Scale your pandas workflows by changing one line of code
Pandas is a library which needs no introduction in the field of Data Science. It provides high-performance, easy-to-use data structures and data analysis tools. However, when working with excessively large amounts of data, Pandas on a single core becomes insufficient and people have to resort to different distributed systems to increase their performance. The tradeoff for improved performance, however, comes with a steep learning curve. Essentially users probably just want Pandas to run faster and aren’t looking to optimize their workflows for their particular hardware setup. This means people want to use the same Pandas script for their 10KB dataset as their 10TB dataset. Modin offers to provide a solution by optimizing pandas so that Data Scientists spend their time extracting value from their data than on tools that extract data.
Modin

Modin is an early-stage project at UC Berkeley’s RISELab designed to facilitate the use of distributed computing for Data Science. It is a multiprocess Dataframe library with an identical API to pandas that allows users to speed up their Pandas workflows.
Modin **** accelerates Pandas queries by 4x on an 8-core machine, only requiring users to change a single line of code in their notebooks. The system has been designed for existing Pandas users who would like their programs to run faster and scale better without significant code changes. The ultimate goal of this work is to be able to use Pandas in a cloud setting.
Installation
Modin is completely open-source and can be found on GitHub: https://github.com/modin-project/modin
Modin can be installed from PyPI:
pip install modin
For Windows, one of the dependencies is Ray. Ray is not yet supported natively on Windows, so in order to install it, one needs to use the WSL(Windows Subsystem for Linux).
How Modin speeds up the execution
On a Laptop
Consider a 4 core modern laptop with a dataframe that fits comfortably in it. While pandas use only one of the CPUs core, modin, on the other hand, uses all of them.

Essentially what modin does is that it simply increases the utilisation of all cores of the CPU thereby giving a better performance.
On a Large Machine
On large machine usefulness of modin becomes much more pronounced. Let’s pretend there is some server or some pretty powerful machine. So pandas will still utilise a single core and again modin will use all of them. Here is a performance comparison of read_csv
with pandas and modin on a 144 core computer.

There is a nice linear scaling in pandas but that’s because it’s still only using one core. It may be hard to see the green bars because they’re so low in modin.
Typically 2 gigabytes takes about 2 seconds and 18 gigabytes take approximately less than 18 seconds.
Architecture
Let’s have a look into the architecture of Modin.
DataFrame Partitioning
The partitioning schema partitions along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and the number of rows supported.

System Architecture
Modin is separated into different layers.:
- Pandas API is exposed at the topmost layer
- Next layer houses the Query Compiler which receives queries from the pandas API layer and performs certain optimizations.
- At the last layer is the Partition Manager and is responsible for the data layout and shuffling, partitioning, and serializing the tasks that get sent to each partition.

Implementing pandas API in Modin
The pandas API is massive and that is a reason probably why it has such a wide range of use cases.

With such a lot of operations at hand, modin followed a data-driven approach. This means the creators of modin took a look at what people mostly use in pandas. They went to Kaggle and did a massive scrape of all the notebooks and scripts present there and ultimately figured out the most popular pandas’ methods which are as follows:

pd.read_CSV
is by far the most used method in pandas followed by, pd.Dataframe.
Therefore, at modin, they started implementing things and optimizing them in the order of their popularity:
- Currently, modin supports about 71% of the pandas API.
- This represents about 93% of usage based on the study.
Ray
Modin uses Ray to provide an effortless way to speed up the pandas’ notebooks, scripts, and libraries. Ray is a high-performance distributed execution framework targeted at large-scale Machine Learning and reinforcement learning applications. The same code can be run on a single machine to achieve efficient multiprocessing, and it can be used on a cluster for large computations. You can find Ray on GitHub: github.com/ray-project/ray.
Usage
Importing
Modin wraps pandas and transparently distributes the data and computation, accelerating the Pandas‘ workflows with one line of code change. Users continue to use previous pandas notebooks while experiencing a considerable speedup from Modin, even on a single machine. Only a modification of the import statement is needed, wherein one needs to import modin.pandas
rather than simple pandas.
import numpy as np
import modin.pandas as pd

Let’s build a toy dataset using Numpy consisting of random integers. Notice, we don’t have to specify partitioning here.
ata = np.random.randint(0,100,size = (2**16, 2**4))
df = pd.DataFrame(data)
df = df.add_prefix("Col:")
When we print out the type, it is a Modin dataframe.
type(df)
modin.pandas.dataframe.DataFrame
if we were to print out the first 5 lines with the head command, it renders an HTML table just like pandas would.
df.head()

Comparisons
Modin manages the data partitioning and shuffling so that users can focus on extracting value from the data. The following code was run on a 2013 4-core iMac with 32GB RAM.
pd.read_csv
read_csv is by far the most used pandas’ operation. Let’s do a quick comparison when we use read_csv in pandas vs modin.
- pandas
%%time
import pandas
pandas_csv_data = pandas.read_csv("../800MB.csv")
-----------------------------------------------------------------
CPU times: user 26.3 s, sys: 3.14 s, total: 29.4s
Wall time: 29.5 s
- Modin
%%time
modin_csv_data = pd.read_csv("../800MB.csv")
-----------------------------------------------------------------
CPU times: user 76.7 ms, sys: 5.08 ms, total: 81.8 ms
Wall time: 7.6 s
With Modin, read_csv
performs up to 4x faster on a 4-core machine just by changing the import statement
df.groupby
pandas groupby is extremely well written and is extremely fast. But even with that, modin outperforms pandas.
- pandas
%%time
import pandas
_ = pandas_csv_data.groupby(by=pandas_csv_data.col_1).sum()
-----------------------------------------------------------------
CPU times: user 5.98 s, sys: 1.77 s, total: 7.75 s
Wall time: 7.74 s
- modin
%%time
results = modin_csv_data.groupby(by=modin_csv_data.col_1).sum()
-----------------------------------------------------------------
CPU times: user 3.18 s, sys: 42.2 ms, total: 3.23 s
Wall time: 7.3 s
Default to pandas implementation
In case one wants to use a pandas API that hasn’t been yet implemented or optimized, one can actually default to pandas. This makes the system usable for notebooks that use operations not yet implemented in Modin even though there will be a drop in performance as it will use the pandas API now. When defaulting to pandas, you will see a warning:
dot_df = df.dot(df.T)

It returns a distributed Modin DataFrame once the computation is complete.
type(dot_df)
-----------------
modin.pandas.dataframe.DataFrame
Conclusions
Modin is still in its early stages and appears to be a very promising add on to the pandas. Modin handles all the partitioning and shuffling for the user so that we can essentially focus on our workflows. Modin’s basic goal is to enable the users to use the same tools on small data as well as big data without having to worry about changing the API to suit different data sizes.
Edit: Since the publishing of this article a lot of people have been asking me how is Modin different from Dask in terms of its offers. Here is a detailed comparison by the author:
Query: What is the difference between Dask and Modin? · Issue #515 · modin-project/modin
References
https://rise.cs.berkeley.edu/blog/modin-pandas-on-ray-october-2018/