Image by PxFuel

Pytolemaic — A Toolbox for Model Quality

A short intro to Pytolemaic package

Orion Talmi
Towards Data Science
6 min readFeb 16, 2020

--

This blog post provides a short introduction to Pytolemaic package (github) and its capabilities. The post covers the following components:

Model analysis techniques

  1. Feature sensitivity
  2. Scoring and confidence intervals
  3. Covariate shift measurement

Model’s predictions analysis

  1. Prediction’s uncertainty
  2. Lime explanations

Intro

Building a Machine Learning (ML) model is quite easy nowadays. However, building a good model still requires experience to avoid the many pitfalls along the way. Fortunately, there are several techniques that can be used to identify these pitfalls.

Pytolemaic package uses such techniques to analyze ML models and measure their quality. Built with convenience in mind, the package has a simple interface that makes it easy to use.

The post is organized as follows:

  1. Getting Started — a section on how to install Pytolemaic package and initiate the PyTrust object
  2. Feature Sensitivity — covering feature sensitivity report and the vulnerability report.
  3. Scoring Report — this part is divided into 2 sections: a section on scoring report for regression task including confidence intervals and covariance shift, followed by a similar section for classification tasks
  4. Predictions’ uncertainty — an overview of the uncertainty methods supported by the package
  5. Lime explanation — a section on the lime integration.

Getting started

Pytolemaic package is built for supervised models (regression and classification) trained on structured data (Titanic in, MNIST out). The model is treated as a black-box — thus no additional information on the model is required. If there is a pre-processing phase (e.g. imputation) preceding the estimator, then it’d need to be encapsulated into a single prediction function, e.g. by using Sklearn’s Pipeline class.

The analysis itself is relatively light-weight. However, some analysis techniques require creating new predictions (e.g. Lime), a process that might be time-consuming and computationally intensive, depending on your model complexity and the size of your dataset.

The package is not built for heavy-lifting. Please let me know if it needs more muscle.

Installation

Pytolemaic package is hosted on github and is available on Pypi.org, thus just use pip:

pip install pytolemaic

Initiate PyTrust Object

To analyze the model, just initiate a PyTrust object with the following information:

  1. [Required] The trained model, the training set, a holdout testing set and the metric you are using to evaluate the model.
  2. [Optional] Class labels, splitting strategy
  3. [Optional] Columns’ metadata: e.g. feature names, feature types, feature encoding
  4. [Optional] Samples’ metadata: e.g. sample weights

While initiating the PyTrust object is quite simple, it is the most complicated part of the package. Thus, on the first usage, consider providing only the required information.

Let’s see how it’s done

Example #0: Initiate PyTrust

Initiating Pytrust with California Housing dataset

Analysis reports

A. Feature sensitivity (FS)
Pytolemaic package implements 2 variations of FS — sensitivity to shuffle, and sensitivity to missing values. Calling pytrust.sensitivity_report() will calculate both types and return a SensitivityFullReport object.

Note: If you are not familiar with the feature sensitivity method, see this great post.

There are 2 ways to retrieve the FS information:
1. API - sensitivity_report.to_dict() will export the report as a dictionary.
2. Graphically - sensitivity_report.plot() will plot any plottable information.

Examples
For this example, we’ll use a Random Forest regressor trained on dataset California Housing (full example here). The California Housing dataset relates the characteristics of a district to the median house value in the district.

Example #1: feature sensitivity analysis

Exporting SensitivityFullReport as a dictionary

Like with most reports there are some fields that are unclear. Thus, in order to provide convenient documentation, the package provides a to_dict_meaning() functionality.

Example #2: Retrieve documentation for the dictionary fields:

Explanation of the items in the sensitivity report dictionary

We saw the FS report by calling to_dict() and saw the documentation available through to_dict_meaning(). Now let’s see it graphically by calling plot().

Example #3: Creating graphs for feature sensitivity reports

Graphs created by .plot() functionality.

Note: the functions to_dict(), to_dict_meaning(), and plot() are available in all Pytolemaic’s reports.

As you can see, there are 3 quality measurements in the feature sensitivity report:

  1. Imputation — measures the vulnerability to imputation by measuring the discrepancy between sensitivity to shuffle and sensitivity to missing values.
  2. Leakage — measures the chance for data-leakage by comparing the sensitivity scores.
  3. Too_many_features — measures whether there are too many features used by counting the number of low-sensitivity features.

Note: The logic behind the vulnerability report will be explained in a separate post.

B. Scoring report for a regression task
With the same pytrust object as above, we call pytrust.scoring_report() to analyze the scoring quality and create a ScoringFullReport object.

As before, we will use a Random Forest regressor for the California Housing dataset.

Example #4: Create a scoring report

Scoring report information, regression task

Confidence intervals
The metric_scores provides the model’s performance (value) for each metric as well as the confidence interval limits (ci_low & ci_high). Additionally, it provides the ci_ratio — a dimensionless value that represents the uncertainty in the score calculation (lower is better). In most cases, the quality of the performance evaluation can be improved by enlarging the test-set.

Covariance shift
The separation_quality measures whether the train and test sets are taken from the same distribution (a.k.a Covariate Shift) using ROC-AUC. Since it’s a ‘quality’ measure higher values are better. Further explanation can be found here and here.

As before, creating graphs for the scoring report is done with .plot()

Example #5: Creating graphs for the scoring report

Scoring report graphs, regression task

As can be seen, the scatter plot contains error bars. These error bars represent the uncertainty of the model’s prediction. More on the uncertainty calculations in the model’s prediction analysis section.

C. Scoring report for a classification task
The scoring report for classification tasks has the same structure but provides different information.

For this example, we will use a Random Forest classifier trained on UCI’s Adult dataset.

Example #6: Create the scoring report

Scoring report information, classification task

Note: in this dataset the train and test sets has different distribution. In a real life datasets, such a low value would be a source of concern.

As before, we can plot some graphs.

Example #7: Creating graphs for scoring report

Scoring report graphs, classification task

Model’s predictions analysis

A. Uncertainty of predictions
Pytolemaic package can provide an estimation for the uncertainty in the model prediction. In the case of a regression task, the uncertainty value represents an error bar, having the same scale as the target variable. On the other hand, in the case of a classification task, the uncertainty value represents how unsure the model is in its prediction on a scale of 0 (max confidence) to 1 (no confidence).

The package supports several techniques, as listed below. Thus, the exact meaning of the uncertainty value depends on the method used.

Regression:
* MAE: an estimation of the absolute error based on a regressor trained on the absolute error of the test set predictions.
* RMSE: an estimation of the absolute error based on a regressor trained on the squared error of the test set predictions.

Classification:
* Probability: an uncertainty measure based on the ratio between the probability values of the 1st and 2nd most probable classes.
* Confidence: an uncertainty measure based on a classifier trained on test set predictions.

Examples
For the uncertainty examples, we will use the Adult dataset as before. However, this time we will initiate the PyTrust object with only half of the test set, and use the other half (let’s call it the prediction set) to see how the uncertainty measurement relates to the prediction errors.

Example #8: Calculating uncertainty based on ‘confidence’

Full code here

We expect that samples with higher uncertainty will have a higher chance to be classified incorrectly. We will now verify this by binning the samples of the prediction set according to their respective uncertainty and then measure the recall for the samples in each bin.

Following this process (code here) we obtain the following graph, which behaves just like we expected.

The recall score for the entire dataset is 0.762. Bins are [0–0.2, 0.2–0.4, 0.4–0.6, 0.6–0.8, 0.8–1.0]

B. Lime explanation

Lime explanation for the model’s prediction is a well-known method. Pytolemaic package essentially wraps lime’s functionality, while improving it in 2 significant ways:

  1. Introducing a simple imputation to overcome lime’s vulnerability to missing values.
  2. Introducing a convergence mechanism to overcome lime’s sensitivity to the generated samples.

Summary

The package implements techniques that help verify the model works as expected. The package is built to be easy-to-use and aims to be used during the model building phase, so give it a go and let me know what you think.

In future posts, I will elaborate more on the logic behind the various quality measurements and how the package can help you to identify errors.

Future work will focus on the model’s predictions (explanation and uncertainty) and on measuring the dataset’s quality.

I hope you’ve enjoyed this post and that you’ve found Pytolemaic package interesting.

--

--