
The first time I looked at a data set that contained over a billion rows of data, I thought my COUNT(*) was wrong.
It wasn’t.
That didn’t stop the need for me to have to analyze the data. The business had data (lots apparently), the business needed insight from that data, and the business had no idea how to get it. So, I swallowed the lump in my throat (it was my first real Data Science job) and did what any junior data scientist would have done at the time, port the data set into the enterprise analytics behemoth…SAS Enterprise Guide (EG; now I’m dating myself).
My anxiety didn’t stop there. Never mind the vertically massive memory on the server used to support our point-n-click SAS EG software. It still wasn’t robust enough to handle the overly complex cluster analytics I was attempting to perform. Memory errors abounded. So I took the next step any junior data scientist would take at this stage in their project, I started Googling.
And this was my first introduction to distributed data science.
Since that time, I have spent a considerable part of my career supporting Big Data analytics leveraging different distributed frameworks. Though my anxiety is eased today when confronted with data sets that keep getting larger…thanks IoT https://www.iotacommunications.com/blog/iot-big-data/ psht…the world of distributed data science continues to rapidly change and evolve to this day.
So today I want to provide my way of thinking about how those evolving frameworks fit in the context of distributed data science workflows. Let’s get to it!
Distributed Data In Vs. Distributed Data Out
The biggest distinction to make when considering distributed data science frameworks is between exploratory data analysis (EDA) versus data science deployments. Or distributed data in versus distributed data out. Let’s start with EDA.
To be clear, I am taking a few liberties with the acronym "EDA" by lumping in all the things data scientists need to do before finalizing a trained model for deployment on new data. The typical pipeline includes accessing data sources, feature engineering and discovery, and model training.
In 2001, Banko and Brill, two researchers at Microsoft, demonstrated that more data has a bigger effect on model performance than the specific type of model chosen. And this concept has stuck with business leaders even though there are several situations where this does not necessarily hold true.
Consequently, businesses are tasking their data scientists to leverage ever larger data sets when doing their work thus leading to larger and larger data sets in the EDA phase of data science.
In other words, nearly all data scientists will face problems in the EDA phase that would benefit from distributed frameworks.
The poster child for dealing with large data sets is the Apache Spark framework, which supports Python (the language of data science 😊 ) via the PySpark implementation. While Spark allows data scientists to wrangle with large data sets by distributing the data set across cores (vertical) and servers (horizontal), we still have issues when it comes to training models.
To solve the model training problem, Spark released SparkML, a library of Machine Learning algorithms that would train a limited set of model types across clusters on those massive data sets. In addition to SparkML, the Spark ecosystem includes a number of other tools to support the full spectrum of "EDA" in the data science pipeline.

Although Spark is open source, configuring Spark to run properly on a cluster of servers is not trivial, and enterprise security and patching protocols make it even harder to implement. For enterprise data scientists this led to the rise of distributed data science platforms like DataBricks and Watson Studio. Today these platforms are largely cloud-based and so the implementation of Spark on the backend is fully managed relieving IT teams of the need to figure out Spark configuration on internal server clusters.
Less popular but a Python-native distributed framework is Dask. Dask is a lighter weight (in terms of compute resources) framework for parallelizing numeric computations. Dask works with common Python libraries like numpy and Pandas but lacks some of the data engineering functionality available in Spark like SparkSQL. More details on the differences between the two can be found here.
Like Spark, Dask also has a suite of ML models that can be parallelized using the Dask framework called, not surprisingly, Dask-ML.
Spark and Dask allow for both data engineering and model training but there are other distributed frameworks that are meant just for parallelizing model training. The most popular being TensorFlow and PyTorch. In fact, PyTorch just released 1.9 which includes more support for distributed model training.
The key benefit to using these more model-focused frameworks is that they make it easier to train models in clusters that use GPUs.
Newer to the distributed data science scene is Ray. Ray works seamlessly with Python and contains highly specialized libraries for dealing with two major use cases that other frameworks haven’t tackled as well; reinforcement learning and model serving. The former is a special case that Ray solves for whereby data scientists can leverage clusters to train reinforcement models. Ray also integrates with PyTorch and TensorFlow making it an interesting framework to follow as it develops.
The fact that Ray also allows for seamless model serving with RayServe, makes Ray a unique library for going from "EDA" to deployment.
Moving to Distributed Data Out (AKA Distributed Deployments)
Working with data and training models is one thing but deploying those models to work with new data can be managed using different distributed toolkits. The key consideration that data scientists need to account for is whether they expect their deployments to be run as jobs on large batches of data, as services in a more on-demand fashion, or some hybrid of the two.
Batch deployments are situations where new data are refreshed in a data store environment and the model needs to score all that new data upon refresh. In other words, the model scoring is often handling a similar load that was experienced during model training. In these batch deployment scenarios, we can use the same backend frameworks that the pipelines were built on since the compute requirements will be similar. Thus, Spark or Dask can be used in deployment and the code scheduled to run using a scheduler or API call using any number of scheduling tools.
Where things get more interesting is when data need to be scored on-demand. For example, say you train a model that can ingest sensor data from a phone using a data streaming service like Kafka and predict the likelihood that someone has just fallen. The model needs to score small amounts of data very quickly and so doesn’t require the same compute resources that large batch jobs require. Now imagine this model is ingesting data from 1 million smart phones. How best can we deploy our models to handle both the speed and compute requirements in this situation?
In instances like these, Docker containers and Kubernetes become valuable tools that allow both model code and the infrastructure to support that model code to rapidly parallelize. Containers are like virtual environments but are not tied to a specific operating system because they run their own OS’s at runtime. Kubernetes is a framework for managing and replicating containers to run across multiple machines.
The downside to Kubernetes, like Spark, is its complexity to configure. But like DataBricks, other cloud platforms have deployed fully managed implementations of Kubernetes or Kubernetes-like services. These are some of the fundamental concepts behind serverless functions like AWS Lambda and represent new options for deploying our data science in these unique situations.
With More Use-Cases Comes More Complexity
In this post, I have provided a way of organizing how to think about distributed data science frameworks and a few basics on how those popular frameworks operate within the data science pipeline. There are undoubtedly more complex hybrid scenarios where we need to run our larger, distributed compute environments like Spark within containers across Kubernetes pods that can parallelize the already parallelized containers to manage greater data needs.
It is my hope that you find this way of organizing your thinking about these complex data science issues helpful and, as always, I welcome feedback.
Like engaging to learn more about data science, career growth, or poor business decisions? Join me.