The world’s leading publication for data science, AI, and ML professionals.

Speeding up a sklearn model pipeline to serve single predictions with very low latency

Writing your own sklearn functions, (for now final) part 3

If you have worked with sklearn before you certainly came across the struggles between using dataframes or arrays as inputs to your transformers and estimators. Both bring their advantages and disadvantages. But once you deploy your model, for example as a service, in many cases it will serve single predictions. Max Halford has shown some great examples on how to improve various sklearn transformers and estimators to serve single predictions with an extra Performance boost and potential responses in low millisecond range! In this short post we will advance these tricks and develop a full pipeline.

A few months ago Max Halford wrote an awesome blogpost where he described how we can modify sklearn transformers and estimators to handle single data points at a higher speed, essentially using one-dimensional arrays. When you build sklearn model pipelines they usually work with numpy arrays and pandas dataframes at the same time. Arrays often provide better performance, because the numpy implementations for many computations are high performant and often vectorized. But it also gets trickier to control your transformations using column names, which the arrays do not have. If you use pandas dataframes you might get worse performance, but your code might get more readable and column names (i.e. feature names) stick with the data for most transformers. During data exploration and model training you are mostly interested in batch transformations and predictions, but once you deploy your trained model pipeline as a service, you might also be interested in single predictions. In both cases service users will send a payload like below.

Slightly helpful illustration - image done by author using draw.io
Slightly helpful illustration – image done by author using draw.io

Imagine a service, where we estimate weight of fish depending on some size measures (in reference to the later introduced fish market dataset), then a request might look as follows:

{
    "species": "Bream",
    "length": 24.5,
    "heigth": 12.3,
    "width": 4.12,
}

or alternatively ["Bream", 24.5, 12.3, 4.12], and the model may return a weight estimation as follows:

In his blogpost Max Halford showed how you can add transform_single and predict_single methods to transformers and estimators to perform single datapoint processing with higher performance. Depending on the complexity of the Pipeline, the absolute amount of saved time might not be huge. But higher response times will add up to the overall latency of your service infrastructure, and short timings will take pressure from the app, especially if it is within the critical path. We will also end up being able to save on infrastructure cost as we can run your service on smaller hardware, i.e. smaller and less pods. Moreover avoiding dataframe coercion will free up memory on the serving instances. Last but not least we save time that we can essentially spend on more sophisticated transformations and models – something that makes every data scientist happy!

Creating bare-bones transformers

But what is the price for cutting the response time? We can explore this by looking at an example, without advertising class inheritance here, but rather as a sketch of how this could work:

barebones_transformer = BarebonesTransformer()
barebones_transformer.fit(data)
barebones_transformer.transform_single([1.0, 2.5])

On the one hand we risk to lose training & inference/prediction parity. What does this mean? As we can see above we essentially have two different code paths our data can take during the transformation: the code path for single predictions will only be used for inference of single data points, but not during training, where we normally transform in batches, i.e. dataframes or arrays. Therefore we need to put extra effort on making sure that both transformation paths lead to the same transformation, and therefore to the same results. This can for example be achieved by adding some extra unit tests.

On the other hand we might lose some of the validation Sklearn is doing internally, i.e. when the parental transform method is called. Therefore we need to make sure to properly validate our payloads before they are passed to the model to keep the model from crashing unexpectedly.

The same ideas also apply to the estimators and the predict method. In the end it is like sending a letter via truck (plus insurance), it works, it is a safe option, but it might be excessive and a mail carrier on a bike might be more appropriate and faster.

Different transportation types for different payloads— image done by author using draw.io
Different transportation types for different payloads— image done by author using draw.io

If we are happy with both drawbacks we can save some time per request if we spent some time on manipulating existing transformers for single datapoints.

Now since we have seen how we could get this working, let us evaluate the performance of the pandas and numpy-based approaches using some toy day and sklearn’s SimpleImputer, a transformer that imputes missing data, for example using the mean. We will use the quite robust pd.isna to check for missing values in our 1d array:

import pandas as pd
import numpy as np
np.random.seed(47723)
# truncate decimals for better printing
np.set_printoptions(precision=3, suppress=True)
pd.set_option('precision', 3)
n = 1000
data = pd.DataFrame({
    'num1': np.random.normal(0, 1, n),
    'num2': np.random.normal(0, 1, n)
})
# remove 10% of the data
data[np.random.rand(*data.shape) > 0.9] = np.nan
data.head()
##     num1   num2
## 0  0.897 -1.626
## 1  1.370  0.279
## 2    NaN -0.652
## 3  1.379 -0.164
## 4  0.450    NaN

The SimpleImputer stores the fitted imputation values in self.statistics_ (by convention fitted values always end with an underscore):

from sklearn.impute import SimpleImputer
simple_imputer = SimpleImputer(strategy='mean')
simple_imputer.fit_transform(data)
## array([[ 0.897, -1.626],
##        [ 1.37 ,  0.279],
##        [ 0.071, -0.652],
##        ...,
##        [-0.233,  0.741],
##        [ 0.071, -0.627],
##        [-1.056, -0.622]])
simple_imputer.statistics_
## array([0.071, 0.016])

We can use these values in our transform_single fill missing values:

Timing the minimal transformers

Let us now evaluate the performance improvement. We will make use of timeit and some simple helper functions to measure the timings in milliseconds:

from timeit import timeit
def time_func_call(call: str, n: int = 1000):
  t = timeit(call, globals = globals(), number=n) / n
  t_ms = np.round(t * 1000, 4)
  return t_ms
time_func_call('barebones_simple_imputer.transform(data)')
## 3.0503
time_func_call('barebones_simple_imputer.transform_single([1.2, np.nan])')
## 0.0701

We will define another helper function which compares and pretty-prints multiple function call timings:

from typing import List
def time_func_calls(calls: List[str]):
    max_width = np.max([len(call) for call in calls])
    for call in calls:
        t_ms = time_func_call(call)
        print(f'{call:{max_width}}: {t_ms:.4f}ms')
    return

We can apply this now to multiple and single datapoints in form of dataframes and numpy arrays:

So the single-datapoint-transformation outperforms the other implementations. Let us quickly check out the OneHotEncoder, another very helpful transformer which encodes categorical variables with dummy variables. We will also define some toy data again:

n = 3000
data = pd.DataFrame({
    'cat1': np.random.choice(['a', 'b', 'c'], n),
    'cat2': np.random.choice(['x', 'y'], n)
})
data.head()
##   cat1 cat2
## 0    a    x
## 1    b    x
## 2    b    y
## 3    a    x
## 4    b    y

The OneHotEncoder stores the learned categories in a list in self.categories_, from where we can pick it up and use it to encode the categorical variables:

barebones_one_hot_encoder = BarebonesOneHotEncoder(sparse=False, handle_unknown='ignore')
barebones_one_hot_encoder.fit_transform(data)
## array([[1., 0., 0., 1., 0.],
##        [0., 1., 0., 1., 0.],
##        [0., 1., 0., 0., 1.],
##        ...,
##        [0., 0., 1., 0., 1.],
##        [1., 0., 0., 1., 0.],
##        [0., 1., 0., 1., 0.]])
barebones_one_hot_encoder.categories_
## [array(['a', 'b', 'c'], dtype=object), array(['x', 'y'], dtype=object)]
barebones_one_hot_encoder.transform_single(['b', 'x'])
## array([0, 1, 0, 1, 0])

Let us evaluate benchmark the different cases again:

The encoder now only needs 0.02 ms (milliseconds) instead of 0.5 ms, improved around factor 25. Now let us plug this together and measure overall performance improvement of a common pipeline. We will fetch some dataset called fish market dataset, which contains size measurements and categorization of fishes, where we want to predict their weight.

The data looks as follows:

x.head()
##   species  length1  length2  length3  height  width
## 0   Bream     23.2     25.4     30.0  11.520  4.020
## 1     NaN     24.0     26.3     31.2  12.480  4.306
## 2   Bream     23.9     26.5     31.1  12.378  4.696
## 3   Bream     26.3     29.0     33.5  12.730  4.455
## 4   Bream     26.5     29.0     34.0  12.444  5.134
y.head()
## 0    242.0
## 1    290.0
## 2    340.0
## 3    363.0
## 4    430.0
## Name: weight, dtype: float64

If we want to apply imputation and one-hot-encoding to our data we need to use ColumnTransformers to dispatch the transformations to the correct columns. Thus we need to do some minor modifications to it to be able to use the transform_single method:

  • implement transform_single similar to transform, for ex. using the self._iter
  • implement an identity transformer with transform_single which can be passed to handle the remainder, i.e. the remaining columns

    Building a fast pipeline

If we want to use the bare-bone transformers and estimators in a pipeline, we have to modify the pipeline itself as well, by adding a [predict](https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/pipeline.py#L382-L408)_single similar to the predict which uses the transform_single methods of the transformers and calls predict_single of the model, as Max also describes in his post.

We can now construct our pipeline. We will impute the categorical variable with the most frequent value and the numeric values with the mean (not the most clever imputation methods here, since a strong relation between both exists and a conditional mean or nearest neighbour approach would be better). We will then one-hot-encode the categorical variable and train a linear model on the data.

Now let us apply the pipeline to our data to benchmark the performance on single predictions:

Let us finally evaluate that both predictions are identical. Running the predict still uses the full-fledged sklearn code path, opposed to our lightweight {transform,predict}_single method:

batch_predictions = barebones_pipeline.predict(x)
batch_predictions[0:5]
## array([285.874, 418.604, 363.433, 417.214, 459.909])
single_predictions = [barebones_pipeline.predict_single(x_i) for x_i in x.to_numpy()]
single_predictions[0:5]
## [285.873, 418.603, 363.433, 417.214, 459.909]
np.all(np.isclose(batch_predictions, single_predictions, atol = 0.0001))
## True

Conclusion

We saw that we can speed up our pipeline by a factor 20 to 25 for single predictions (2.4ms to 0.1ms). But the more transformations we add the more valueable the speedup will be, and the clearer the tradeoff might become. We have seen how we can use custom transformers or adjust existing ones to speed up single data point transformations and predictions, at the price of extra time spent on engineering (especially if the transformation is more complex), and extra care spent on training-inference parity, unit tests and data validation.

Remark: profiling transformers

If you are trying to find bottlenecks of your transformers I recommend using line_profiler and memory_profiler. They might not be oversee-able on the whole pipeline (you have to pass all individual functions to it), but on the individual transformers. You can use the profiler in the following fashion, with magic:

or without magic:


Originally published at https://blog.telsemeyer.com.


Related Articles