Interpreters¶
Modules under Interperters are used to tap-out all necesssry inputs transformation from dataloader.
The level of transformation depends on the feature it is used for. The outputs from interpreters can either be the direct inputs to visualizers
or involves secondary processing before parsing to visualizer for graphing.
Int - General Metrics¶
- class rarity.interpreters.structured_data.IntGeneralMetrics(data_loader: Union[rarity.data_loader.data_loader.CSVDataLoader, rarity.data_loader.data_loader.DataframeLoader], viz_plot: Optional[str] = None)[source]¶
Transform raw data into input format suitable for visualization. General metrics cover confusion matrix, classification report, roc curve and precisionRecall curve
- Parameters
data_loader (
CSVDataLoaderorDataframeLoader) – Class object from data_loader moduleviz_plot (str) – Supported visualization types :
confMat,classRpt,rocAUC,preRacall,stdErr, None
Important Attributes:
- analysis_type (str):
Analysis type defined by user during initial inputs preparation via data_loader stage.
- Returns
Dataframe with essential info suitable for visualization on regression task
- Return type
DataFrame
Note
if classification, returns:
yTrue data in
SeriesyPred data in
Seriesfor [confMat,classRpt] orDataframefor [rocAuc,precRecall]
If multiclass, returns:
yPred data in
List[List[Tuple]]pairing class label and yPred inSeriesmodel_names in
List[str]
Int - Miss Predictions¶
- class rarity.interpreters.structured_data.IntMissPredictions(data_loader: Union[rarity.data_loader.data_loader.CSVDataLoader, rarity.data_loader.data_loader.DataframeLoader])[source]¶
Transform raw data into input format suitable for visualization on miss-prediction points
- Parameters
data_loader (
CSVDataLoaderorDataframeLoader) – Class object from data_loader module- Returns
Dataframe with essential info suitable for visualization on regression task
- Return type
DataFrame
Note
if classification, returns:
Compact outputs consist of the followings
ls_dfs_viz (
List[~pd.DataFrame]): list of dataframes for overview visualization needls_class_labels (
List[str]): list of class labelsls_dfs_by_label (
List[~pd.DataFrame]): list of dataframes by individual label classls_dfs_by_label_state (
List[~pd.DataFrame]): list of dataframes storing basic stats of each label class
Int - Loss Clusters¶
- class rarity.interpreters.structured_data.IntLossClusterer(data_loader: Union[rarity.data_loader.data_loader.CSVDataLoader, rarity.data_loader.data_loader.DataframeLoader])[source]¶
Transform raw data into input format suitable for visualization on loss clusters
- Parameters
data_loader (
CSVDataLoaderorDataframeLoader) – Class object from data_loader module
- extract_misspredictions()[source]¶
Function to tapout list of dataframe with prediction state info included
- xform(num_cluster: int, log_func: math.log, specific_dataset: str)[source]¶
Core transformation function to tap-out data into input format suitable for plotly graph
- Parameters
num_cluster (int) – Number of cluster to form
log_funct (
math.log) – Mathematics logarithm function used to calculate log-loss between yTrue and yPredspecific_dataset (str) – Default to ‘All’ indicating to include all miss-predict labels. Other options flexibly expand depending on class labels
- Returns
Compact outputs consist of the followings
df (
DataFrame): dataframes for overview visualization need with offset values includedls_score (
List[float]): list of silhouette scores, indication of clustering qualityls_cluster_range (
List[List[int]]): list of list containing cluster number range from 1 to 10ls_ssd (
List[float]): sum of squared distance generated via kmean_inertia
Note
if classification, returns:
Compact outputs consist of the followings
ls_dfs_viz (
List[~pd.DataFrame]): dataframes for overview visualization need with offset values includedls_class_labels (
List[str]): list of all class labelsls_class_labels_misspred (
List[str]): list of class labels with minimum of 1 miss-predictionls_score (
List[float]): list of silhouette scores, indication of clustering qualityls_cluster_range (
List[List[int]]): list of list containing cluster number range from 1 to 10ls_ssd (
List[float]): sum of squared distance generated via kmean_inertia
Int - xFeature Distribution¶
- class rarity.interpreters.structured_data.IntFeatureDistribution(data_loader: Union[rarity.data_loader.data_loader.CSVDataLoader, rarity.data_loader.data_loader.DataframeLoader])[source]¶
Transform raw data into input format suitable for visualization on feature distribution
- Parameters
data_loader (
CSVDataLoaderorDataframeLoader) – Class object from data_loader module
- _generate_kl_div_info_base(df: pandas.core.frame.DataFrame, feature_to_exclude: List)[source]¶
Function to generate dictionary like output storing kl-divergence score for each feature arranged in descending order.
- _get_df_feature_with_pred_state_cls(df_overall: pandas.core.frame.DataFrame)[source]¶
For classification task only. Function to tap-out customized df combining features and relevant prediction info for use in visualization.
- _get_probabilities_by_bin_group(df_viz: pandas.core.frame.DataFrame, bin_count: int)[source]¶
For regression task only. Function to tap-out customized df for ease of getting probabilities based on bin group for reference df and sliced df
- _get_probabilities_by_feature(df_viz: pandas.core.frame.DataFrame, specific_feature: str)[source]¶
For classification task only. Function to calculate probabilities of correct vs miss-predict for specific feature
- _get_single_feature_df_with_binning(df: pandas.core.frame.DataFrame, feature: str)[source]¶
For regression task only. Function to find optimum bin-size on sliced df for distribution comparison
- xform(feature_to_exclude: Optional[List[str]] = None, start_idx: Optional[int] = None, stop_idx: Optional[int] = None)[source]¶
Core transformation function to tap-out data into input format suitable for plotly graph
- Parameters
feature_to_exclude (List of
str, optional) – A list of features to be excluded from the kl-div calculation and visualizationstart_idx (
int, optional) – Integer number indicating the start index position to slice dataframestop_idx (
int, optional) – Integer number indicating the stop index position to slice dataframe
- Returns
dictionary storing kl-divergence score for each feature in decending order
- Return type
DictorList(Dict)
Int - Similarities (+CounterFactuals)¶
- class rarity.interpreters.structured_data.IntSimilaritiesCounterFactuals(data_loader: Union[rarity.data_loader.data_loader.CSVDataLoader, rarity.data_loader.data_loader.DataframeLoader])[source]¶
Transform raw data into input format suitable for visualization on Similarities / Counter-Factuals
- Parameters
data_loader (
CSVDataLoaderorDataframeLoader) – Class object from data_loader module
- _get_ranking_and_distance_metrics(user_defined_idx: int, feature_to_exclude: List, top_n: int)[source]¶
Compute distance scores and generate index list sorted by distance ranking
- _label_encode_categorical_features(df: pandas.core.frame.DataFrame, categorical_cols: List)[source]¶
Fit-transform categorical features with
LabelEncoder
- xform(user_defined_idx: int, feature_to_exclude: Optional[List[str]] = None, top_n: int = 3)[source]¶
Core transformation function to tap-out data into input format suitable for plotly graph
- Parameters
user_defined_idx (int) – Index of the data point of interest specified by user
feature_to_exclude (List of
str, optional) – A list of features to be excluded from the ranking and similarities distance calculationtop_n (int) – Number indicating the max limit of records to be displayed based on the distance ranking
- Returns
Outputs consist of the followings
idx_for_top_n (
List[int]): list of integer numbers indicating the ranking position in ascending ordercalculated_distance (
List[float]): list of calculated euclidean_distances
Note
if classification, returns:
Outputs consist of the followings
df_viz (
DataFrame): dataframes for overview visualization need with true labels and predicted labels includedidx_for_top_n (
List[int]): list of integer numbers indicating the ranking position in ascending ordercalculated_distance (
List[float]): list of calculated euclidean_distances