The world’s leading publication for data science, AI, and ML professionals.

A simple yet useful data visualization library for your EDA

Visualize the relationship between a dependent variable and any feature in a meaningful way

Exploratory data analysis/bivariate analysis/data science

Exploratory Data Analysis, or EDA, is particularly relevant for projects that involve tabular data. During the EDA, the cautious data scientist generally tries to assess the quality of its datasets by looking for outliers or by checking if certain columns have an abnormal number of missing values for instance. EDA is also the ideal stage to get a first idea about the bivariate relation between the predictors and the dependent variable.

For continuous data, the good old linear correlation coefficient is generally doing a decent job and quickly indicates what can be the most predictive features. However, things get a bit more complicated with categorical data.

To my surprise, I was not able to find a Python library* that allows you to quickly understand the relationship between a target and any predictor, irrespectively of their type (continuous or categorical). I have therefore decided to write my own tool with the following requirements in mind:

  • Provide a unified interface irrespectively of the data type
  • Generate visuals output that is easy to communicate to a non-technical audience (like business stakeholders)

This post aims at providing a tour of the main functionalities of the library. The library is named tprojection, it is available on GitHub and can be easily installed with pip

pip install tprojection

I will use the popular Titanic dataset to illustrate the main functionalities of the library.

Case 1: both the target and the predictor are categorical

When building a predictive model, you’re generally looking for features that allow you to define segments that strongly discriminate against the target value. Those are good predictors.

When both the target and the predictor are categorical, you can assess the quality of your predictor by projecting the target on each modality. This can be done with a single pandas statement:

df.groupby(predictor).agg({target: ["mean", "count"]})

If we apply this command for the predictor "sex" and the target "survived" from the Titanic dataset, we obtain something like that:

From this output, it is pretty clear that men are less likely to survive than women. In its most basic use, the library tprojection will simply provide a visual way to carry the information above.

This chart represents the relation between the target "survived" and the predictor "sex". The blue bars denote the number of observations for each modality (left y-axis). The red line represents the average value of the target for each modality (right y-axis). The black dashed line represents the average value of the target over the whole dataset (right y-axis). The red shaded area is a confidence interval estimated by bootstrapping the data.
This chart represents the relation between the target "survived" and the predictor "sex". The blue bars denote the number of observations for each modality (left y-axis). The red line represents the average value of the target for each modality (right y-axis). The black dashed line represents the average value of the target over the whole dataset (right y-axis). The red shaded area is a confidence interval estimated by bootstrapping the data.

The chart above can be generated with few lines of code:

from tprojection import Tprojection
from tprojection.datasets import load_data
df = load_data("titanic")
target = "survived"
predictor = "sex"
tproj = Tprojection(df, target, predictor, target_type="categorical", feature_type="categorical", n_estimators=100)
tproj.plot()

First, an instance of Tprojection is created with the desired arguments and options. Then, the method plot is called and draws the chart. A full description of the plot is provided in the caption. Note that the figure axes are stored as attributes (tproj.ax1 and tproj.ax2), you can therefore easily change the properties of the chart.

The confidence interval (red shaded area) is particularly useful when the number of observation gets small. In this case, we might observe large deviations of the conditional average from the baseline probability suggesting that the modality is good for segmenting the target. However, this interpretation has to be taken with care when the confidence interval is large. Indeed, this implies that the relation between the predictor and target may significantly vary across the different bootstrapped samples, indicating a substantial risk of overfitting.

This effect can be clearly seen in the chart below which analyses the dependence of the target on the number of parents/children aboard (parch). For modalities parch = 3 and parch = 5, we observe that the survival rate is relatively high while the associated confidence intervals span over both sides of the baseline probability.

Relationship between the target "survived" and the predictor "parch".
Relationship between the target "survived" and the predictor "parch".

Interestingly, the width of the confidence interval associated with parch = 4 and parch = 6 is equal to zero even though the corresponding number of observations is small. This happens when there is a one-to-one relationship between the target and the modalities. In other words, there are no survivors in the passenger group with 4 or 6 parents/children aboard.

The library can also deal with predictors with high cardinality. In this situation, displaying the full list of modalities can make the chart pretty unclear. To address this point, you can pass an optional argument to tprojection that allows you to bucketize the modalities on a maximum of nb_buckets values.

tproj = Tprojection(df, target, "parch", target_type="categorical", feature_type="categorical", n_estimators=100, nb_modalities=10)
tproj.plot()
Relationship between the predictor "cabin" and the target "survived". The left plot shows the raw modalities while the right plot shows the bucketized modalities.
Relationship between the predictor "cabin" and the target "survived". The left plot shows the raw modalities while the right plot shows the bucketized modalities.

The charts above clearly show the benefit of the approach. The left plot, which displays the relationship between the predictor "cabin" and the target for the raw modalities, is impossible to read. The right plot is much clearer since it only shows a handful of encoded modalities. The mapping between the original and encoded modalities can be accessed through an attribute:

print(tproj.encoding)

Tprojection will try to conserve modalities with a sufficient number of observations "as-is" while the other modalities are grouped. Tprojection aims at building nb_buckets groups that contain approximately the same number of observations. However, this is not always possible, especially if the distribution of the modalities is strongly skewed. And this is often the case in practice. In the example above, we required initially 10 buckets but we finally end up with 3 buckets only. This happens since the modality cabin = nan constitutes almost 80% of the observations. As a consequence, the remaining observations are spread over 3 buckets only. The bucketsg1 and g2 contains about 10% of the observations each, the rest is assigned to g3.

Case 2: the target is continuous while the predictor is categorical.

This is somehow a variation of the first case. In addition to the target mean value, I have added a boxplot depicting the distribution of the target over each modality. The computation and display of the confidence interval had been disabled to keep the viz readable.

tproj = Tprojection(df, "fare", "cabin", nb_modalities=10)
tproj.plot()
Relationship between the predictor "cabin" (bucketized) and the target "fare".
Relationship between the predictor "cabin" (bucketized) and the target "fare".

In the snippet above, the options target_type and predictor_type are not specified. In this case, the library uses a rule of thumb to assess if the variables are continuous or categorical.

Case 3: the target is categorical while the predictor is continuous

In this scenario, we simply compare the distribution of the predictor conditioned by the target value. To facilitate the comparison, the histograms are normalized. The count for each target value is provided in the legend.

tproj = Tprojection(df, "survived", "fare")
tproj.plot()
Relationship between the predictor "fare" and the target "survived".
Relationship between the predictor "fare" and the target "survived".

By default, the positive class corresponds to the minority class for a binary target. You can change this behavior by specifying the desired value with the target_modality optional argument. At this stage, tprojection can not genuinely cope with multiclass problems but note that you can mimic a one-against-all approach by using the target_modality kwarg.

Case 4: both the predictor and the feature are continuous

This case was included for the sake of being exhaustive but it is certainly not where tprojection brings the most value. The library just displays a scatter plot of the two variables together with the linear correlation coefficient. The best regression line derived from the seaborn regplot method is shown as well. There are no additional special features since there are many good tools out there that provide advanced functionalities to analyze the correlation between two continuous variables.

tproj = Tprojection(df, "fare", "age")
tproj.plot()
Relationship between the predictor "age" and the target "fare".
Relationship between the predictor "age" and the target "fare".

That’s all folks! I hope you will find this simple lib useful. Feel free to fork the repo, hack the code, and adapt it to your needs.

*Disclaimer: I discovered the great sweetviz library just before releasing this post. Even though sweetviz includes similar functionalities and much more, I still think that tprojection brings interesting features. Especially regarding the built-in estimation of the confidence interval or the way it can cope with predictors with high-cardinality.


Related Articles