Interpreters

Modules under Interperters are used to tap-out all necesssry inputs transformation from dataloader. The level of transformation depends on the feature it is used for. The outputs from interpreters can either be the direct inputs to visualizers or involves secondary processing before parsing to visualizer for graphing.

Int - General Metrics

class rarity.interpreters.structured_data.IntGeneralMetrics(data_loader: Union[rarity.data_loader.data_loader.CSVDataLoader, rarity.data_loader.data_loader.DataframeLoader], viz_plot: Optional[str] = None)[source]

Transform raw data into input format suitable for visualization. General metrics cover confusion matrix, classification report, roc curve and precisionRecall curve

Parameters
  • data_loader (CSVDataLoader or DataframeLoader) – Class object from data_loader module

  • viz_plot (str) – Supported visualization types : confMat, classRpt, rocAUC, preRacall, stdErr, None

Important Attributes:

  • analysis_type (str):

    Analysis type defined by user during initial inputs preparation via data_loader stage.

Returns

Dataframe with essential info suitable for visualization on regression task

Return type

DataFrame

Note

if classification, returns:

  • yTrue data in Series

  • yPred data in Series for [confMat, classRpt] or Dataframe for [rocAuc, precRecall]

If multiclass, returns:

  • yPred data in List[List[Tuple]] pairing class label and yPred in Series

  • model_names in List[str]

Int - Miss Predictions

class rarity.interpreters.structured_data.IntMissPredictions(data_loader: Union[rarity.data_loader.data_loader.CSVDataLoader, rarity.data_loader.data_loader.DataframeLoader])[source]

Transform raw data into input format suitable for visualization on miss-prediction points

Parameters

data_loader (CSVDataLoader or DataframeLoader) – Class object from data_loader module

Returns

Dataframe with essential info suitable for visualization on regression task

Return type

DataFrame

Note

if classification, returns:

Compact outputs consist of the followings

  • ls_dfs_viz (List[~pd.DataFrame]): list of dataframes for overview visualization need

  • ls_class_labels (List[str]): list of class labels

  • ls_dfs_by_label (List[~pd.DataFrame]): list of dataframes by individual label class

  • ls_dfs_by_label_state (List[~pd.DataFrame]): list of dataframes storing basic stats of each label class

Int - Loss Clusters

class rarity.interpreters.structured_data.IntLossClusterer(data_loader: Union[rarity.data_loader.data_loader.CSVDataLoader, rarity.data_loader.data_loader.DataframeLoader])[source]

Transform raw data into input format suitable for visualization on loss clusters

Parameters

data_loader (CSVDataLoader or DataframeLoader) – Class object from data_loader module

extract_misspredictions()[source]

Function to tapout list of dataframe with prediction state info included

xform(num_cluster: int, log_func: math.log, specific_dataset: str)[source]

Core transformation function to tap-out data into input format suitable for plotly graph

Parameters
  • num_cluster (int) – Number of cluster to form

  • log_funct (math.log) – Mathematics logarithm function used to calculate log-loss between yTrue and yPred

  • specific_dataset (str) – Default to ‘All’ indicating to include all miss-predict labels. Other options flexibly expand depending on class labels

Returns

Compact outputs consist of the followings

  • df (DataFrame): dataframes for overview visualization need with offset values included

  • ls_score (List[float]): list of silhouette scores, indication of clustering quality

  • ls_cluster_range (List[List[int]]): list of list containing cluster number range from 1 to 10

  • ls_ssd (List[float]): sum of squared distance generated via kmean_inertia

Note

if classification, returns:

Compact outputs consist of the followings

  • ls_dfs_viz (List[~pd.DataFrame]): dataframes for overview visualization need with offset values included

  • ls_class_labels (List[str]): list of all class labels

  • ls_class_labels_misspred (List[str]): list of class labels with minimum of 1 miss-prediction

  • ls_score (List[float]): list of silhouette scores, indication of clustering quality

  • ls_cluster_range (List[List[int]]): list of list containing cluster number range from 1 to 10

  • ls_ssd (List[float]): sum of squared distance generated via kmean_inertia

Int - xFeature Distribution

class rarity.interpreters.structured_data.IntFeatureDistribution(data_loader: Union[rarity.data_loader.data_loader.CSVDataLoader, rarity.data_loader.data_loader.DataframeLoader])[source]

Transform raw data into input format suitable for visualization on feature distribution

Parameters

data_loader (CSVDataLoader or DataframeLoader) – Class object from data_loader module

_generate_kl_div_info_base(df: pandas.core.frame.DataFrame, feature_to_exclude: List)[source]

Function to generate dictionary like output storing kl-divergence score for each feature arranged in descending order.

_get_df_feature_with_pred_state_cls(df_overall: pandas.core.frame.DataFrame)[source]

For classification task only. Function to tap-out customized df combining features and relevant prediction info for use in visualization.

_get_df_sliced(start_idx: int, stop_idx: int)[source]

Slice dataframe to the specific range.

_get_probabilities_by_bin_group(df_viz: pandas.core.frame.DataFrame, bin_count: int)[source]

For regression task only. Function to tap-out customized df for ease of getting probabilities based on bin group for reference df and sliced df

_get_probabilities_by_feature(df_viz: pandas.core.frame.DataFrame, specific_feature: str)[source]

For classification task only. Function to calculate probabilities of correct vs miss-predict for specific feature

_get_single_feature_df_with_binning(df: pandas.core.frame.DataFrame, feature: str)[source]

For regression task only. Function to find optimum bin-size on sliced df for distribution comparison

xform(feature_to_exclude: Optional[List[str]] = None, start_idx: Optional[int] = None, stop_idx: Optional[int] = None)[source]

Core transformation function to tap-out data into input format suitable for plotly graph

Parameters
  • feature_to_exclude (List of str, optional) – A list of features to be excluded from the kl-div calculation and visualization

  • start_idx (int, optional) – Integer number indicating the start index position to slice dataframe

  • stop_idx (int, optional) – Integer number indicating the stop index position to slice dataframe

Returns

dictionary storing kl-divergence score for each feature in decending order

Return type

Dict or List(Dict)

Int - Similarities (+CounterFactuals)

class rarity.interpreters.structured_data.IntSimilaritiesCounterFactuals(data_loader: Union[rarity.data_loader.data_loader.CSVDataLoader, rarity.data_loader.data_loader.DataframeLoader])[source]

Transform raw data into input format suitable for visualization on Similarities / Counter-Factuals

Parameters

data_loader (CSVDataLoader or DataframeLoader) – Class object from data_loader module

_apply_standard_scale(df: pandas.core.frame.DataFrame)[source]

Standard scale features

_get_categorical_features(df: pandas.core.frame.DataFrame)[source]

Identify categorical features

_get_ranking_and_distance_metrics(user_defined_idx: int, feature_to_exclude: List, top_n: int)[source]

Compute distance scores and generate index list sorted by distance ranking

_label_encode_categorical_features(df: pandas.core.frame.DataFrame, categorical_cols: List)[source]

Fit-transform categorical features with LabelEncoder

xform(user_defined_idx: int, feature_to_exclude: Optional[List[str]] = None, top_n: int = 3)[source]

Core transformation function to tap-out data into input format suitable for plotly graph

Parameters
  • user_defined_idx (int) – Index of the data point of interest specified by user

  • feature_to_exclude (List of str, optional) – A list of features to be excluded from the ranking and similarities distance calculation

  • top_n (int) – Number indicating the max limit of records to be displayed based on the distance ranking

Returns

Outputs consist of the followings

  • idx_for_top_n (List[int]): list of integer numbers indicating the ranking position in ascending order

  • calculated_distance (List[float]): list of calculated euclidean_distances

Note

if classification, returns:

Outputs consist of the followings

  • df_viz (DataFrame): dataframes for overview visualization need with true labels and predicted labels included

  • idx_for_top_n (List[int]): list of integer numbers indicating the ranking position in ascending order

  • calculated_distance (List[float]): list of calculated euclidean_distances