Shortening model deployment with TensorFlow

How to make ML model deployment and serving easier with TensorFlow

Published in

Towards Data Science

10 min readSep 28, 2022

Image generated by Stable Diffusion from prompt “Cool machine learning and AI stuff”.

Usually, for real-life machine learning applications, the end goal is to deploy the model in production to be used by customers. But for a system powered by a machine learning application, there is much more than just the model prediction, two other main steps are preprocessing and postprocessing.

Pre-processing is related to all steps preceding the actual prediction. For image classification it could be the normalization, usually, vision models require that the input pixels are between 0 and 1. In the case of text models, preprocessing could be text tokenization or removing white spaces and punctuation. Preprocessing can take numerous forms, but it comes down to processing the inputs so that the model can make a reliable prediction.

Postprocessing on the other hand is responsible for all steps required to process the model’s outputs so that it takes the desirable form. In the case of classification tasks, usually, the model outputs the probabilities for each class, but the end user normally doesn’t care about those, so we need to process those outputs so that they can take the form of labels, in text sentiment analysis those labels usually are “positive” and “negative”.

The challenge begins when preprocessing and postprocessing are part of a complex system, that could be already in active usage and needs to be maintained.

Imagine the following scenario — you are part of a company that offers ML products for health care including an API that customers can use to send images and predict if an x-ray has a health issue. Initially, the engineers created a baseline with good results that used a ResNet-50 model, so the company started by using it, but after a few weeks of research the scientists came up with an improved version that used MobileNetV2, so the team switched models, and after a few weeks more, they came up again with an even better model that used EfficientNetB3. Great! In a single month, you improved your results 3 times and your customer are happier, but what was the challenge here?

During that single month, you had to adjust your systems to work with 3 different kinds of models, each one expecting inputs differently. To exemplify let’s suppose that the team used TensorFlow-based models and loads them using the Keras applications module, there you can see that pre-trained ResNet-50 expects inputs to have size 224 x 224 x 3 (width x height x channels), the channels being in the format of BGR and zero-center each color channel with respect to the ImageNet dataset. MobileNetV2 also expects the image with a size of 224 x 224 x 3 but expects the pixels to be in the -1 and 1 range. Finally, EfficientNetB3 expects images to have a size of 384 x 384 x 3 and already does image scaling as part of the model.
In this scenario, you had an initial pipeline with a specific preprocessing routine, then had to change it to support the second model, and then change it again for the third one, now imagine if you need to swap models more than once a week, serve more than one model in parallel, roll-back to previous models, evaluate old models against new ones, it can be pretty easy to mix up different preprocessing routines, lose code, or forget important details, especially because the company is likely providing multiple services for different customers, this is just to showcase how easily things can get out of hand.

So what can we do? While managing such a system is indeed a complex task, we can make things much easier by fusing or embedding the preprocessing and postprocessing logic of the models into themselves.

Regular pipeline vs pipeline with processing logic embedded

Let's go through a practical example to showcase this concept.
Here I will be building a simple system using TensorFlow and using the “Cleveland Clinic Foundation for Heart Disease” dataset, the objective is to classify whether a patient has heart disease. This example is based on the Keras code example “Structured data classification from scratch”.
You can follow this example with the Colab notebook used to write the full code.

To simplify we are going to use only a subset of the features, here is some information about them:

Age: Age in years “Numerical”
Sex: (1 = male; 0 = female) “Categorical”
CP: Chest pain type (0, 1, 2, 3, 4) “Categorical”
Thal: 3 = normal; 6 = fixed defect; 7 = reversible defect “Categorical”
Target: Diagnosis of heart disease (1 = true; 0 = false) “Target (binary)”

I chose those features specifically because they have different data types with different preprocessing requirements.

TensorFlow preprocessing layers

TensorFlow has a built-in way to handle different data types, the preprocessing layers, one big advantage of them compared to regular preprocessing steps is that you can combine those layers with models or TensorFlow datasets to optimize the end-to-end pipeline, also making deployment much easier.

For regular use cases, those layers need to be adapted to learn how to process the data, the “adapt” step is similar to how we need to learn the mean and standard deviation of a feature before being able to normalize the data, and this is what we will do in this first case.

Normalization (age feature)

For the numeric feature “age” we will apply feature normalization, in other words, we will shift and scale inputs into a distribution centered around 0 with a standard deviation of 1, this is exactly what the normalization layer does.
First, we create a TensorFlow dataset that only loads the data from a pandas data frame and maps in the format of raw features and labels

raw_data = tf.data.Dataset.from_tensor_slices(
            (dict(data[[“age”, “sex”, “cp”, “thal”]]), data.target))

After this, we can create the preprocessing layers that will learn how to preprocess the features using this dataset by using the adapt method.

age_preprocessing = L.Normalization()
age_preprocessing.adapt(raw_data.map(lambda x, y: x["age"])
                              .map(lambda x: tf.expand_dims(x, -1)))
print(f"Age mean: {age_preprocessing.mean.numpy()[0]:.2f}")
print(f"Age variance: {age_preprocessing.variance.numpy()[0]:.2f}")
---------- Outputs ----------
Age mean: 54.27
Age variance: 81.74

What is happening here is that I am getting the raw_data dataset and extracting only the age feature by mapping that lambda operation, then I expand the dimensions just to fit the expected format of the preprocessing layer.
After adapting to the data, the Normalization layer learns that the mean of the “age” feature is 54.27 and the variance is 81.74. Now based on that it can normalize new data of that feature.

IntegerLookup (sex feature)

Next, we can learn to preprocess the “sex” feature which is a categorical numeric feature, that means that each value actually means different categories, here they are (1 = male; 0 = female). For this specific case we also want the layer to be able to handle unknown or wrong values, in a normal case if the input is different from 1 or 0, the layer would throw an error and no output would be given, but here those values will be preprocessed as “-1” and the model will return a prediction, this behavior is dependent on each context.

sex_preprocessing = L.IntegerLookup(output_mode="int")
sex_preprocessing.adapt(raw_data.map(lambda x, y: x["sex"]))
sex_vocab = sex_preprocessing.get_vocabulary()
print(f"Vocab size: {len(sex_vocab)}")
print(f"Vocab sample: {sex_vocab}")
---------- Outputs ----------
Vocab size: 3
Vocab sample: [-1, 1, 0]

Here output_mode=”int” means that the output of this layer is an Integer.
After adapting to the data, the IntegerLookup layer learns that the “sex” feature has a vocabulary of size 3 (including OOV), and assigns [-1, 1, 0] as the possible values for the inputs. Note that here the main benefit of this layer is handling unknown values, since the outputs are equal to the inputs.

IntegerLookup (cp feature)

For the “cp” feature which is another categorical numeric feature, we will also use the IntegerLookup layer similarly to the “sex” feature, but we will change the output format.

cp_preprocessing = L.IntegerLookup(output_mode="one_hot")
cp_preprocessing.adapt(raw_data.map(lambda x, y: x["cp"]))
cp_vocab = cp_preprocessing.get_vocabulary()
print(f"Vocab size: {len(cp_vocab)}")
print(f"Vocab sample: {cp_vocab}")
---------- Outputs ----------
Vocab size: 6
Vocab sample: [-1, 4, 3, 2, 1, 0]

Here output_mode=”one_hot” means that the output of this layer has the one-hot encoding format, in this case, each output has a format similar to “[0, 0, 0, 0, 1, 0]”, this format can be useful to give the model more information about the features, especially if the feature does not have a boolean or ordinal nature.
After adapting to the data, the IntegerLookup layer learns that the “cp” feature has a vocabulary of size 6 (including OOV), and assigns [-1, 4, 3, 2, 1, 0] as the possible values for the inputs, further processed to one-hot encoding format.

StringLookup (thal feature)

Finally, for the “thal” feature, we will use the StringLookup layer, for this feature we have string values that can come either as text or numbers.

thal_preprocessing = L.StringLookup(output_mode="one_hot")
thal_preprocessing.adapt(raw_data.map(lambda x, y: x["thal"]))
thal_vocab = thal_preprocessing.get_vocabulary()
print(f"Vocab size: {len(thal_vocab)}")
print(f"Vocab sample: {thal_vocab}")
---------- Outputs ----------
Vocab size: 6
Vocab sample: ['[UNK]', 'normal', 'reversible', 'fixed', '2', '1']

Similarly to the “cp” feature we also have output_mode=”one_hot” which means the outputs of this feature will also have the one-hot encoding format, since the vocabulary size is small this option should be good.
After adapting to the data, the StringLookup layer learns that the “thal” feature has a vocabulary of size 6 (including OOV), and assigns [‘[UNK]’, ‘normal’, ‘reversible’, ‘fixed’, ‘2’, ‘1’] as the possible values for the inputs, here [UNK] is assigned for unknown values (OOV), and those values are further processed to a one-hot encoding format.

Combining layers and dataset

Now that we have all the preprocessing layers adapted to our data, we can make them part of our dataset pipeline. The motivation for this is that the tf.data pipeline can take advantage of the preprocessing layers to make data ingestion faster and more efficient. Another option would be to use those layers as part of the model and train it that way, the trade-offs are discussed in this article in the “Preprocessing data before the model or inside the model” section.

I will not go into details about this part since it's not the scope of this article, but you can take a look at the related Colab notebook, in the “Datasets” section.

Modeling

In a regular setup, after preprocessing the data you would end up with a model that looks like this:

age_input = L.Input(shape=(1,), dtype=tf.float32)
sex_input = L.Input(shape=(1,), dtype=tf.float32)
cp_input = L.Input(shape=(len(cp_vocab),), dtype=tf.float32)
thal_input = L.Input(shape=(len(thal_vocab),), dtype=tf.float32)concat_inputs = L.Concatenate()([age_input, sex_input, 
                                 cp_input, thal_input])
x = L.Dense(32, activation="relu")(concat_inputs)
x = L.Dropout(0.5)(x)
output = L.Dense(1, activation="sigmoid")(x)
  
model = tf.keras.models.Model(inputs=[age_input, sex_input, 
                                      cp_input, thal_input], 
                              outputs=output)
model.summary()
---------- Outputs ----------____________________________________________________________________
Layer (type)              Output Shape Param # Connected to                      ====================================================================
age (InputLayer)          [(None, 1)]  0       []                                                                                                                                    sex (InputLayer)          [(None, 1)]  0       []                                                                                                                                    cp (InputLayer)           [(None, 6)]  0       []                                                                                                                                    thal (InputLayer)         [(None, 6)]  0       []                                                                                                                                    concatenate (Concatenate) (None, 14)   0       ['age[0][0]', 
                                                'sex[0][0]',
                                                'cp[0][0]', 
                                                'thal[0][0]']                                                                                                                        dense (Dense)             (None, 32)   480     ['concatenate[0][0]']                                                                                                         dropout (Dropout)         (None, 32)   0       ['dense[0][0]']                                                                                                                   dense_1 (Dense)           (None, 1)    33      ['dropout[0][0]']                                                                                                                    ====================================================================
Total params: 513
Trainable params: 513
Non-trainable params: 0
____________________________________________________________________

This model would have all the problems that we discussed in the beginning, for deployment we need to keep track of the exact preprocessing parameters used during training, this would require a lot of effort from the maintainers. Thankfully, TensorFlow allows us to embed the model preprocessing logic into our models.

Combining model and preprocessing

It is fairly simple to combine the preprocessing layers with our model, in fact, we could combine basically any TensorFlow operation that could be turned into a TensorFlow graph, let’s see how the new model would look like:

age_input = L.Input(shape=(1,), dtype=tf.int64)
sex_input = L.Input(shape=(1,), dtype=tf.int64)
cp_input = L.Input(shape=(1,), dtype=tf.int64)
thal_input = L.Input(shape=(1,), dtype=tf.string)# Preprocessing
age_processed = age_preprocessing(age_input)
sex_processed = tf.cast(sex_preprocessing(sex_input), 
                        dtype=tf.float32)
cp_processed = cp_preprocessing(cp_input)
thal_processed = thal_preprocessing(thal_input)# Model prediction
output = model({"age": age_processed, 
                "sex": sex_processed, 
                "cp": cp_processed, 
                "thal": thal_processed})# Postprocessing
label_postprocess = label_postprocessing(output)
model = tf.keras.models.Model(inputs=[age_input, sex_input, 
                                      cp_input, thal_input], 
                              outputs=label_postprocess)
model.summary()
---------- Outputs ---------- ____________________________________________________________________
Layer (type)          Output Shape Param # Connected to                      ====================================================================
sex (InputLayer)      [(None, 1)]  0    []                                                                                                                                    age (InputLayer)      [(None, 1)]  0    []                                                                                                                                    cp (InputLayer)       [(None, 1)]  0    []                                                                                                                                    sex_preprocessing     (None, 1)    0    ['sex[0][0]']      
thal (InputLayer)     [(None, 1)]  0    []                                                                                                                                    age_preprocessing     (None, 1)    3    ['age[0][0]'] cp_preprocessing      (None, 6)    0    ['cp[0][0]'] 
tf.cast (TFOpLambda)  (None, 1)    0    ['sex_preprocessing[0] [0]']                                                                                                           thal_preprocessing    (None, 6)    0    ['thal[0][0]'] 
model (Functional)    (None, 1)    513  ['age_preprocessing[0][0]', 
                                         'cp_preprocessing[0][0]', 
                                         'tf.cast[0][0]', 
                                         'thal_preprocessing[0][0]']                                                                                                          label_postprocessing  (None, 1)    0    ['model[0][0]']  ====================================================================
Total params: 516
Trainable params: 513
Non-trainable params: 3 ____________________________________________________________________

Note that I also included a “postprocessing” layer there, that layer is responsible for mapping the model output (0 or 1) to an actual label, I have omitted the code to create it because it is similar to what we did before, but it is also included in the Colab notebook.
You can see that the models are essentially the same, but the second one has all the preprocessing logic embedded into the model graph, the advantage of this approach is that you can save and load that model, and it would have everything that it needs to run inference. Let’s see how they differ:

Inference for the first model

sample = {"age": 60, "sex": 1, "cp": 1, "thal": "fixed"}sample = {"age": age_preprocessing(sample["age"]),
          "sex": sex_preprocessing(sample["sex"]), 
          "cp": cp_preprocessing(sample["cp"]),
          "thal": thal_preprocessing(sample["thal"])}print(model.predict(sample))
---------- Outputs ----------
0

Inference for the second model

sample = {"age": 60, "sex": 1, "cp": 1, "thal": "fixed"}print(model.predict(sample))
---------- Outputs ----------
"Have heart disease"

As you can see, with the second approach, all that your inference service needs to do is load the model and use it.

Conclusion

In this article, we discussed some of the challenges that we face during model deployment related to the preprocessing and postprocessing of models, we also looked into a way to dramatically reduce the cognitive load and maintenance effort to deploy and serve models, that is embedding all the logic into the model itself, all that using TensorFlow.

Note: “All images unless otherwise noted are by the author.”