Machine Learning Model deployment using Spark

Spark making ML deployment easier

Charu Makhijani
Towards Data Science

--

Photo by Chris Liverani on Unsplash

A while ago, I wrote a post about Productionizing Machine Learning Models, where I mentioned strategies for deploying Machine Learning models into production. This article is about one of the widely used approaches — Batch Prediction using Spark. Using this approach, you can schedule a job to run the predictions at a specific time and output them to the database/file systems/streaming / any persistent layer.

Batch Prediction Solutions

Nowadays many solutions/tools are available for batch prediction, simplest is writing code in python and scheduling it with Cron, but there are 2 problems with this approach-

1. With each prediction, there comes an overhead of pickling/unpickling the model.

2. We are not using scikit-learn’s optimizations for predicting a solution.

Another solution is to use tools like, Airflow and Perfect for batch predictions. Cloud solutions are also available for batch predictions like — MLFlow and Amazon Sagemaker.

But if you want to take advantage of scikit-learn’s optimizations and don’t want to rely on external tools, then Spark comes as a promising solution for batch predictions.

Batch Prediction using Spark

When I talk about deploying a machine learning model using Spark, many think that it can be done easily using Spark MLlib, and Yes; it's indeed a solution. But I am more inclined towards using scikit-learn and TensorFlow for machine learning as they are more flexible and provide more robust methods. So I prefer creating my ML models using scikit-learn and Tensorflow pipelines, deploying models over Spark, and then using batch prediction for scoring/forecasting.

Spark not only provides scheduling and monitoring capabilities for batch predictions, but it is also scalable for more complex models and datasets. Moreover, if you come from a Data Engineering background and have a solid experience in Spark and Hadoop, then you don’t need any third-party solutions for ML model deployments. In this post, I will explain how you can deploy ML models over Spark and perform predictions using parallel processing.

Image by Author

Spark Batch Prediction Steps

Batch Prediction using Spark is a 7-step solution and the steps are the same for both Classification and Regression Problems-

1. Create an ML model & pickle it and store the pickle file in HDFS.

2. Write a spark job and unpickle the python object.

3. Broadcast this python object over all Spark nodes.

4. Create a pyspark UDF and call predict method on the broadcasted model object.

5. Create a feature column list on which the ML model was trained.

6. Create a spark data frame for prediction with one unique column and features from step 5.

7. Create predict column in Dataframe from step 6 and call UDF with feature columns.

And it's done!!

Now you have a spark data frame with a unique id, all feature columns, and a prediction column. And once implemented, can be reused for many ML model predictions.

Deep Dive

Now let's see the actual code implementation (in python) for the above 7 steps.

In Step 1, we create an ML model and persist in a pickle file.

Here rf in line 3, is a Random Forest model trained for credit card fraud detection. If you want to see how I created this random forest prediction model please refer GitHub link.

In Step 2 & 3, we will create a spark job, unpickle the python object and broadcast it on the cluster nodes. Broadcasting python object will make the ML model available on multiple nodes for parallel processing of the batch.

In Step 4, we’ll create a pyspark UDF to predict fraud detection.

In Steps 5 & 6, we’ll create a feature list and spark data frame with unique Ids.

Finally in Step 7, we’ll call pyspark UDF on data frame from step 6 and generate predictions with feature columns.

Using these steps, any batch prediction model created using scikit-learn can be deployed on Spark cluster.

For Tensorflow models, there are a few more steps involved which I’ll cover in my next posts.

In this approach, steps 1–4 and step 7 are the same for all models and if you can pickle the feature columns from step 5 with step 1 and unpickle later, then this process will be completely automated and can be used for productionizing any batch prediction models built in scikit-learn. I have used it many times and automated the complete machine-learning batch prediction.

Conclusion

Productionizing ML models with Spark batch is a very simple, easy, and fast-to-implement solution for many problems. Depending on at what frequency input data is changing the batch predictions can be scheduled once/many times in a day. Batch predictions using Spark are stateless and can also be combined with real-time predictions to run in parallel. Spark batch comes with the capability to schedule jobs in yarn with logging and monitoring functionalities. Also as Spark is completely scalable and you can manage the cluster on your own, the process of batch prediction is very fast.

To access the complete code for batch prediction using Spark, please refer GitHub link.

If you are interested in learning about model deployment strategies and real-time deployment of ML models, refer-

Thanks for the read. If you like the story please like, share, and follow for more such content. As always, please reach out for any questions/comments/feedback.

Github: https://github.com/charumakhijani
LinkedIn:
https://www.linkedin.com/in/charu-makhijani-23b18318/

--

--

ML Engineering Leader | Writing about Data Science, Machine Learning, Product Engineering & Leadership | https://github.com/charumakhijani