Image by PixelAnarchy from Pixabay

mlmachine

mlmachine - Clean ML Experiments, Elegant EDA & Pandas Pipelines

This new Python package accelerates notebook-based machine learning experimentation

TL;DR

mlmachine is a Python library that organizes and accelerates notebook-based machine learning experiments.

In this article, we use mlmachine to accomplish actions that would otherwise take considerable coding and effort, including:

  • Data intake & feature type identification
  • Easy, elegant exploratory data analysis
  • Pandas-in / Pandas-out pipelines

Check out the Jupyter Notebook for this article.

Check out the project on GitHub.

Check out other entries in the mlmachine series:

What is mlmachine?

Notebooks often serve as scratch paper for data scientists. Machine learning experiments tend to become messy, disjointed series of hard-coded blocks. Even if time is taken to write general purpose functions, those functions live isolated, uselessly locked away from new projects.

mlmachine is a Python package that facilitates clean and organized notebook-based machine learning experimentation and accomplishes many key aspects of the experimentation life cycle.

The central hub of mlmachine is the Machine() class. A Machine() object retains the dataset, target data and feature meta data. More importantly, a Machine() object has numerous built-in methods for quickly executing key parts of the machine learning experiment workflow.

Here are a few of the core areas of mlmachine’s functionality that we will explore in detail:

  1. Data intake & mlm_dtype identification
  2. Exploratory data analysis - here is an example of just how easy it is to create a panel of visualizations and data summaries for a feature:
Thorough EDA, simple execution

3. Pandas-friendly transformers and pipelines - see how simply wrapping the mlmachine utility PandasTransformer() around OneHotEncoder() maintains our DataFrame:

mlmachine contains an immense amount of functionality aimed at saving time and improving model performance, all while keeping our workflow clean and organized.

Let’s get started.

The Machine Class - A Hub With Many Spokes

Data Intake

We begin by instantiating a Machine() object:

Let’s unpack what we just did. Using the canonical Titanic dataset, we instantiate a Machine() object, called mlmachine_titanic, by:

  • Passing in the full dataset as a DataFrame
  • Specifying the column that contains the target variable
  • Specifying the supervised learning task as a classification task

The most basic purpose of mlmachine_titanic is to maintain our dataset of observations and our target values. Our dataset is stored as a DataFrame and can be accessed by calling mlmachine_titanic.data:

Our target variable, stored as a named Pandas Series, can be accessed just as easily by calling mlmachine_titanic.target:

The target values will be automatically label encoded, if needed.

We also passed several lists containing feature names to parameters such as identify_as_continuous and identify_as_nominal. Let’s get into the purpose of these parameters.

mlm dtypes: Adding Feature Meaning to Pandas dtypes

Pandas dtypes describe the values contained within a column, but have no regard for what the values actually mean. Nominal categories, ordinal categories, continuous numbers, counts…it’s often impossible to make these distinctions from Pandas dtypes alone.

In the mlmachine ecosystem, these distinctions are referred to as mlm_dtypes. mlmachine catalogs and, most importantly, updates mlm_dtypes as the dataset evolves through feature engineering.

Our dataset, stored as mlmachine_titanic.data, has a metadata attribute called mlm_dtypes:

This dictionary is a cornerstone of the mlmachine workflow and permeates throughout all of the package’s functionality. Notice the dictionary’s keys. Per the guidance we provided when instantiating the Machine() object, mlm_dtypes stores the mlm dtype for each feature.

The dictionary keys make it particularly easy to reference all features of a certain type without having to type out feature names. The practical benefit of this is obvious, especially if we consider datasets larger than this tiny Titanic dataset.

In this article, we’re going to leverage this efficiency in two key areas:

  • Exploratory data analysis
  • Transformations and pipelines

Let’s put mlm_dtypes to use as we introduce mlmachine’s exploratory data analysis capabilities.

Because EDA is Tedious and Takes Forever

We’re all guilty of performing a cursory EDA, if any at all (“let’s just get to the model training!”). Even with all of the great Python visualization libraries out there, EDA can take a considerable amount of setup. Coding those same, slightly modified functions for the hundredth time is something we all do. And remembering which visualization types work best for which feature types and feature/target-type combination is not easy.

Skipping EDA is absolutely a mistake, so a portion of mlmachine’s functionality is dedicated to quickly making panels that are as beneficial as they are good looking.

Categorical Feature Panels

We saw a teaser EDA panel in the introduction for the nominal category feature “Embarked”. Let’s go beyond this example by using our mlm_dtypes dictionary to quickly generate panels for all of our categorical features in mlmachine_titanic.data:

eda_cat_target_cat_feat() generates an EDA panel for a categorical or count feature in the context of a categorical target. There are three summary tables at the top:

  1. Feature Summary - Simple count of each level in the category, along with the percent of values each level constitutes in the feature.
  2. Feature vs. target summary - Count of each level in the category, grouped by the classes in the target
  3. Target proportion - The percent of values of a particular feature level, grouped by the classes in the target.

The panel includes three visualization. From left to right:

  1. Tree map of the categorical feature.
  2. Bar chart of the categorical feature, faceted by the target.
  3. 100% horizontal stacked bar chart, faceted by the target.

Effortless Extension to Multi-class Problems

Now let’s use eda_cat_target_cat_feat() to generate a panel for a multi-class example. We’ll use the Scikit-learn wine dataset to visualize a numeric feature called “alcalinity_of_ash” that has been segmented into 5 bins, effectively making it a categorical column:

Each component of the panel adapts accordingly to the multi-class problem in this dataset.

Continuous Feature Panels

Now let’s see what mlmachine can do with numeric features:

eda_cat_target_num_feat() is a method that generates a panel for a numeric features in the context of a categorical target. At the top, we display three Pandas DataFrames:

  1. Feature Summary - All of the summary statistics we get by executing the standard df.describe() command, plus “percent missing”, “skew” and “kurtosis”.
  2. Feature vs. target summary - Count, proportion, mean and standard deviation of the numeric feature, grouped by the different classes in the target.
  3. Statistical test - If the target column only has two classes, this reports the result of a z-test (or t-test, in the case of small samples) and the associated p-value.

Below the summary tables is a panel containing four visualizations. From left to right, starting in the top left corner:

  1. Univariate distribution plot of the numeric feature.
  2. QQ plot of the numeric feature.
  3. Bivariate distribution plot of the numeric feature, faceted by the target.
  4. Horizontal box plot, faceted by the target.

Another Effortless Extension to Multi-class Problems

eda_cat_target_num_feat() also adapts to multi-class problems easily. Let’s look at another quick example:

We again create this panel using the Scikit-learn wine dataset with the same minimal code. Notice the changes:

  1. The “Feature vs. target summary” table expands to reflect all three classes.
  2. The faceted plots expand to visualize all three classes.
  3. The x-axis and y-axis tick labels are decimals rather than whole numbers. This modification happens dynamically under the hood based on the scale of the data being visualized. Less time formatting, more time exploring.

mlmachine introduces an immense amount of simplicity and dynamism to EDA. Now let’s see how mlmachine facilitates a Pandas-friendly ML experimentation workflow.

Pandas-in / Pandas-out Pipelines

Scikit-learn Dismantles Pandas DataFrames

A major drawback of putting a DataFrame through an Scikit-learn transformer is the loss of the DataFrame wrapper around the underlying NumPy array. The issue is particularly pronounced with transformers like PolynomialFeatures():

And if we think we’ve outsmarted this transformer by accessing the poly.get_feature_names() attribute, we’ll be woefully disappointed once we see the output:

Not helpful.

Due to this, we lose the ability to:

  • Easily perform EDA on the transformed dataset
  • Evaluate feature importance after training a model
  • Use model explainability methods such as SHAP or LIME
  • Merely identity which columns are which

Of course we could feed the NumPy array back into a DataFrame, and do whatever is needed to get the columns to match up, but…what a chore.

Transformers, Now with DataFrames

mlmachine leverages a class called PandasTransformer() to ensure that if a DataFrame passes into a transformer, a DataFrame comes out on the other side.

All we have to do was wrap PolynomialFeatures() with PandasTransformer() and we get a DataFrame with meaningful column names on the other side:

It’s that easy.

Now that we’ve seen how to preserve our DataFrame when executing a single transformation, let’s build on this with Scikit-learn’s Pipeline() and FeatureUnion() functionality to perform multiple actions on multiple sets of features in one shot.

PandasFeatureUnion & DataFrameSelector - Intuitive, Familiar, Flexible

Vanilla FeatureUnion

Scikit-learn includes a class called FeatureUnion(). To quote the documentation, FeatureUnion() “ Concatenates results of multiple transformer objects…This is useful to combine several feature extraction mechanisms into a single transformer.”

This is a great tool for applying different data processing actions to different features. For example, we may want to mean impute continuous features and mode impute categorical features:

Unfortunately, FeatureUnion() also suffers from the same disadvantage as other transformers - it returns a NumPy array. This is where PandasFeatureUnion() comes to the rescue.

PandasFeatureUnion & DataFrameSelector

Just like we need PandasTransformer() to retain the DataFrame() post-transformation, we need PandasFeatureUnion() to maintain the final DataFrame post-concatenation.

Basic Example

We start fresh here by again instantiating a Machine() object called mlmachine_titanic. Then we use mlmachine’s PandasFeatureUnion() class to create a DataFrame-friendly, FeatureUnion-style pipeline called impute_pipe. Here is the output, still in a DataFrame:

Specifically, we perform three different types of imputations on three different columns that have nulls:

  • Impute “Age” with the mean
  • Impute “Embarked” with the mode
  • Impute “Cabin” with a constant value (X).

A keen observer will notice the presence of another class - DataFrameSelector() - within each pipeline. This class is an essential element of the PandasFeatureUnion() workflow, and serves different purposes depending on how it’s used. On lines 30, 34 and 38, DataFrameSelector() is used to select the column for that particular branch of the union. The columns are selected by name using the include_columns parameter.

On line 42, we do something a bit different. Since FeatureUnion() operations, by design, act on specific columns and concatenate the results, we would be left with only the transformed columns without further intervention.

That is why DataFrameSelector() has the flexibility to select all columns except those specified. By way of the exclude_columns parameter, we select all features except for the features we imputed. This ensures we keep our full dataset.

Now that we have filled in our nulls values, let’s advance to a slightly more complicated preprocessing step using the PandasFeatureUnion() workflow. And if there is any question about the purpose of mlm_dtypes, that will become even clearer right now.

Less Basic Example

We now have a DataFrame with encoded columns, all clearly named:

Let’s take this PandasFeatureUnion branch by branch:

  • “nominal” pipeline - Here we see the flexibility of DataFrameSelector(). First, we select all nominal columns by passing [“nominal”] to the include_mlm_dtypes parameter. DataFrameSelector() directly references mlm_dtypes to make column selections. Second, we exclude “Cabin” (also a nominal feature) by passing the feature name to the exclude_columns parameter. DataFrameSelector() reconciles our include/exclude specifications by selecting all nominal columns except “Cabin”. Lastly, we pass our selected columns to OneHotEncoder(), wrapped in PandasTransformer().
  • “ordinal” pipeline - We again us the DataFrameSelector() parameter include_mlm_dtypes, this time to select all ordinal columns. Then we pass our result to OrdinalEncoder(), wrapped in PandasTransformer(). We also provide encoding instructions. When we instantiated our Machine() object, we passed in a dictionary called ordinal_encodings, which mlmachine_titanic stores as an attribute. We wrap this dictionary’s values in a list and pass it to the OrdinalEncoder() parameter categories. This will ensure the desired hierarchy is enforced during encoding.
  • “bin” pipeline - We again use include_mlm_dtypes to select all continuous features and pass the result to KBinsDiscretizer(), wrapped in PandasTransformer().
  • “diff” pipeline - The last step is to recombine any features that would otherwise be lost in the union operation, and drop any features we no longer need. We perform a list comprehension on the mlm_dtypes attribute to remove “Cabin”, and append the keys of mlmachine_titanic.ordinal_encodings to the result. This will ensure that the original nominal and ordinal features are not in the transformed dataset, but that we retain “Cabin”. Notice that we do not exclude the continuous columns, despite the fact we transformed these features with KBinsDiscretizer(). The reason is simple - we want to keep the original continuous columns in our dataset.

We call encode_pipe with the familiar fit_transform() method.

Updating mlm_dtypes

Since we have new features in the dataset, it is best practice to follow fit_transform() with update_mlm_dtypes().

As alluded to earlier in this article, we can update mlm_dtypes to reflect the current state of the data attribute. Let see a before-and-after to clearly see how mlm_dtypes dictionary changed:

Our updated mlm_dtypes dictionary is on the right. We see that the nominal columns “Embarked” and “Sex” are absent, and in their place we see names for dummy columns, such as “Embarked_C” and “Sex_male”, resulting from PandasTransformer(OneHotEncoder()). Also notice that the “nominal” key still contains the “Cabin” feature, which we chose to leave unprocessed at this point.

The “ordinal” key contains our binned versions of “Age” and “Fare”, as well as “Pclass”, which has been named in a way that clearly indicates the type of encoding applied.

Whenever we modify data, we simply call update_mlm_dtypes(), and mlm_dtypes will automatically update to reflect the current state of the dataset. The only real effort is to identify the mlm dtype for each feature from the outset, which is something we should do every time anyway.

Let’s bring things to a conclusion by leveraging themlm_dtypes dictionary to quickly perform a little EDA on our new features. This time, we’ll cycle through all of our ordinal features:

We covered a lot of ground, but we have only just begun to explore mlmachine’s functionality.

Check out the GitHub repository, and stay tuned for additional column entries.

--

--