
I think pandas
needs no introduction. It is a great and versatile tool which data scientists use and will most likely continue using on a daily basis. But there are also some of the potential challenges we might face when using pandas
.
The biggest issue is connected to the volume of data, which can definitely become an issue in the age of big data. And while there are many tasks that do not involve such vast volumes of data, sooner or later we might run into one of those. In such a case, we can try a few tricks. First, we can try to optimize the data types of the variables stored in a DataFrame to make the data fit into memory. Alternatively, we can load only chunks of the whole data at a time.
Those solutions can often help, but sometimes they are simply not enough and we can end up running out of memory or with operations that become unbearably slow. And in exactly those cases, we might want to move away from pandas
and use a better tool for the job.
The goal of the article is not to provide a performance comparison of all the possible approaches. Instead, I want to present the possible alternatives to pandas
and briefly cover their potential use-case, together with their strengths and weaknesses. Then, you can choose which of the solutions fits your needs and you can dive deeper into the nitty-gritty details of the implementation.
Recap on threads vs. processes
Throughout the article, we will be mentioning running parallel operations using either threads or processes. I think it calls for a quick refresher:
- processes do not share memory and run on a single core. They are better for compute-intensive tasks that do not have to communicate with each other.
- threads share memory. In Python, due to the Global Interpreter Lock (GIL), two threads cannot operate at the same time in the same program. As a result, only some operations can be run in parallel using threads.
For more information on threads vs. processes, please refer to this article.
The list of pandas alternatives
In this section, we cover the most popular (as of early 2022) alternatives to pandas
. The order of the list is not a ranking of best to worst or any ranking as a matter of fact. I simply try to present the approaches and while doing so introduce some structure when there are logical bridges between the various solutions. Let’s start!
Dask – ~10k GitHub stars
Dask is an open-source library for distributed computing. In other words, it facilitates running many computations at the same time, either on a single machine or on many separate computers (cluster). For the former, Dask allows us to run computations in parallel using either threads or processes.

Dask relies on the principle called lazy evaluation. It means that the operations are not carried out until we explicitly ask for it (using the compute
function). By delaying the operations, Dask creates a queue of transformations/calculations, so that they can be executed later in parallel.
Under the hood, Dask breaks a single large data processing job into many smaller tasks, which are then handled by numpy
or pandas
. Afterward, the library reassembles the results into a coherent whole.
Some key points about Dask:
- a good choice for extending a data processing workload from a single machine to a distributed cluster (possibly to clusters with 1000s of cores). We can easily use the very same code to test run some tasks on a local machine with a sample of the whole dataset. Then, we can re-use the very same code on the full data and run it on a cluster.
- the data does not have to fit into the memory, it needs to fit on the disk instead.
- it builds on top of existing, well-known objects such as
numpy
arrays andpandas
DataFrames – there is no need to discard the current approach and rewrite something from scratch - API is very similar to
pandas
, with the exception that it has lazy behavior. - Dask provides a task scheduling interface for more custom workloads and integration with other projects. Additionally, it provides a lot of interactive graphs and visualizations of the task distribution for in-depth analysis and diagnostics.
- Dask is not only data processing. Dask-ML (a separate library) provides scalable machine learning in Python using Dask alongside popular machine learning libraries like
scikit-learn
,xgboost
,lightgbm
, and others. It helps with scaling both the data size and model size. One example could be the fact that manyscikit-learn
algorithms are written for parallel execution usingjoblib
(it supports the well-knownn_jobs
parameter). Dask scales these algorithms out to a cluster of machines by providing an alternativejoblib
backend.
Useful references:
Modin – ~7k GitHub stars
Modin is a library designed to parallelize pandas
DataFrames by automatically distributing the computation across all of the system’s available CPU cores. Thanks to that, Modin claims to be able to get nearly linear speed-up to the number of CPU cores on your system.
So how does that happen? Modin simply divides an existing DataFrame into different parts such that each part can be sent to a different CPU core. And to be more precise, Modin partitions the DataFrames across both rows and columns, which makes its parallel processing highly scalable for DataFrames of any size and shape.
The authors of the library focus on prioritizing the data scientists’ time over hardware time. That is why Modin:
- has the same API as
pandas
and hence results in no additional cost of learning for the data scientists. Unlike most of the other libraries, it aims for full coverage of thepandas
API. At the moment of writing, it offers >90% ofpd.DataFrame
‘s and >88% ofpd.Series
‘s functionalities. If some function/method is not implemented, Modin defaults topandas
, so in the end all commands are executed. - is very simple to run and serves as a drop-in replacement for
pandas
. In fact, we just need to change one import line toimport modin.pandas as pd
. - offers fluent integration with the Python ecosystem.
- runs not only on your local machine, but also on Ray/Dask clusters. We have covered Dask before, so it only makes sense to mention Ray. Ray is a high-performance distributed execution framework. The very same code can be run on a single machine (with efficient multiprocessing) and on a dedicated cluster for large-scale computations.
- supports out-of-core mode in which Modin uses the disk as overflow storage for memory. This way, we can work with datasets far bigger than our RAM.
So how is Modin different from Dask? There are a few differences worth mentioning:
- as opposed to Dask, Modin offers full
pandas
compatibility. For the sake of scalability, Dask Dataframes offer row-based storage. That is why they cannot fully support allpandas
functionalities. In contrast, Modin is designed as a flexible column store. - Dask DataFrames need to be explicitly computed (as they are in the lazy mode) using the
compute
method. In Modin, all the optimizations on the user’s query are executed under the hood, without any input from the user. - the number of partitions in Dask must be explicitly stated.
- Modin can run on top of Dask, but it was originally built to work with Ray.
For even more information about the differences, please see the links below.
Useful references:
- https://github.com/modin-project/modin
- Scaling Interactive Data Science Transparently with Modin: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-191.pdf
- https://rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/
- a post describing the differences between Dask and Modin: https://github.com/modin-project/modin/issues/515#issuecomment-477722019
swifter – ~2k GitHub stars
swifter
is an open-source library that tries to efficiently apply any function to a Pandas
DataFrame or Series in the fastest available manner. apply
is an incredibly useful function, as it allows us to easily apply any function to a pandas
object. However, it comes at a price -the function acts as a for loop, which results in poor speed.
Other than vectorizing the function in the first place, there are already quite a lot of parallel alternatives mentioned in this very article. So where does swifter
fit into all of this?
We have mentioned that it tries to apply the function in the fastest possible way. First, if possible, swifter
vectorizes the function. If that is not possible, it estimates what is faster: parallel processing using Dask/Modin or a simple pandas
apply.
Key features of swifter
:
- low learning curve – it’s a matter of adding
swifter
in theapply
method chain. For example:
df["col_out"] = df["col_in"].swifter.apply(some_function)
- as of now,
swifter
supports speeding up the following methods:apply
,applymap
,rolling().apply()
andresample().apply()
- it benefits from the potential of libraries such as Dask/Modin
- we should not blindly throw any function at
swifter
and hope for the best. That is why when writing our UDFs, we should try to allow for the function to be vectorized. One example would be usingnp.where
instead of the if-else conditional flow.
Useful references:
- https://github.com/jmcarpenter2/swifter
- https://medium.com/@jmcarpenter2/swifter-1-0-0-automatically-efficient-pandas-and-modin-dataframe-applies-cfbd9555e7c8
- https://github.com/jmcarpenter2/swifter/blob/master/examples/swifter_apply_examples.ipynb
vaex – 7k GitHub stars
Vaex is another open-source DataFrame library, which specializes in lazy out-of-core DataFrames.
Probably the biggest highlight of the library is that Vaex requires negligible amounts of RAM for inspection and interaction with a dataset of arbitrary size. That is possible thanks to a combination of lazy evaluations and memory mapping. The latter is a technique where you tell the operating system that you want a piece of memory to be in sync with the content on the disk. When some piece of memory is not modified or used for a while, it will be discarded so that RAM can be reused.
In practice, when we open a file with Vaex, no data is actually read. Instead, Vaex only reads the file’s metadata: location of the data on disk, the structure of the data (number of rows/columns, column names and types), the file description, etc.
That is why one of the requirements of benefiting from Vaex is storing the data in a memory mappable file format, for example, Apache Arrow, Apache Parquet, or HDF5. If we fulfill this requirement, Vaex will open such a file instantly, regardless of how large it is, or how much RAM we have at our disposal.
Key features of Vaex:
- API similar to
pandas
. - easy to work with very large datasets – by combining memory mapping and lazy evaluations, Vaex is only limited by the amount of free hard-drive space we have.
- while libraries such as Dask focus on letting us scale out code from the local machine to clusters, Vaex focuses on making it easier to work with large datasets on a single machine.
- Vaex creates no memory copies , as filtered DataFrames are only a shallow copy of the original. This translates into the fact that filtering costs very little memory. Let’s assume we have a 50GB file. Many tools would require 50GB to read the file and then approximately the same for the filtered DataFrame.
- virtual columns are created when we transform existing columns of a Vaex DataFrame. They behave just like normal ones, with the key distinction that they use no memory at all. That is due to the fact that Vaex only remembers their definition and does not actually calculate the values. The virtual columns are lazily evaluated only when necessary.
- Vaex is really fast, as the evaluation of virtual columns is fully parallelized and uses C++ implementations of some popular column methods (
value_counts
,groupby
, etc.). What is more, all of those work out-of-core , which means that we can process much more data than we could fit into RAM, while using all available cores. - The speed comes also from smart optimizations which allow us to calculate some statistics for multiple selections (without making a new reference DataFrame each time) with just one pass over the data. And what is even better, we can combine those with
groupby
aggregation, while still passing over the data once. - Vaex can further accelerate the evaluation of functions by using Just-In-Time compilation via Numba, Pythran or CUDA (a CUDA enabled NVIDIA graphics card is required).
- Vaex adheres to the policy of going over the entire dataset only when it has to, and then with as few passes over the data as possible. For example, when displaying a Vaex DataFrame or column, Vaex reads only the first and last 5 rows from the disk.
- Vaex also provides very fast and memory-efficient string manipulations (almost all
pandas
operations are supported). - there is also a
vaex.ml
library which implements some common data transformations, for example, PCA, categorical encoders, and numerical scalers. They come with the benefits of a familiar API, parallelization and out-of-core execution. The library also provides an interface to several of the popular machine learning libraries such asscikit-learn
orxgboost
. By using it, we do not waste any memory while doing the data wrangling parts (cleaning, feature engineering, and pre-preprocessing. This allows us to maximize the available RAM for training the model.
That was quite a lot of information. Let’s also briefly cover some differences between Vaex and the previously mentioned approaches.
- while Dask is not fully compatible with
pandas
and Modin aims to be, as a result, those libraries carry some of the baggage inherent topandas
. By deviating more from the source (but still being quite similar), Vaex is less constrained in terms of its functionalities (memory-mapped style of querying, etc.) - Dask and Modin scale to clusters, while Vaex tries to help users avoid the need for clusters by memory-mapping files and using all the available cores of the local machine.
- the author of Vaex describes the relationship between Vaex and Dask as orthogonal. Dask (and Modin) focus mostly on data processing and wrangling, while Vaex also provides the ability to quickly calculate statistics on N-dimensional grids and has some features for easy visualization and plotting large datasets.
For a more in-depth comparison of Vaex and Dask, please see this article.
Useful references:
- https://github.com/vaexio/vaex
- https://vaex.io/
- https://towardsdatascience.com/dask-vs-vaex-a-qualitative-comparison-32e700e5f08b
datatable – 1.5k GitHub stars
datatable
is a Python library for manipulating 2-dimensional tabular data. It was developed by H2O.ai and its first user was the Driverless.ai. It many ways, it is similar to pandas
, with special emphasis on speed and big data (up to 100GB) support on a single-node machine.
If you have worked with R, you might be already familiar with the related package called data.table
, which is R users’ go-to package when it comes to the fast aggregation of large data. Python’s implementation attempts to mimic its core algorithms and API.
Speaking of the API, it is actually the "love it or hate it" difference between datatable
and pandas
(and R’s data.frame
). In datatable
, the primary way of carrying out all operations is the square-bracket notation, which was inspired by traditional matrix indexing. An example would be:
DT[i, j, ...]
where i
is the row selector, j
is the column selector, and ...
indicates additional modifiers which might be added. While this is already familiar, because it is exactly the same notation as encountered when indexing matrices or objects in R/pandas
/numpy
, there are also some differences.
One of them is that i
could be anything that can be interpreted as a row selector: an integer, a slice, a range, a list of integers, a list of slices, an expression, a boolean-/integer-valued Frame, a generator, and many more. But that is still familiar and should not be a big issue.
The tricky part arrives when we want to carry out more advanced operations, as datatable
‘s syntax deviates further from what most of us are used to. For example:
DT[:, sum(f.quantity), by(f.product_id)]
The snippet calculates the sum of quantity over products. And the unfamiliar f
is a special variable that has to be imported from the datatable
module. It provides a shortcut way of referencing any column in a given Frame.
Key points about datatable
:
- The DataFrames in
datatable
are called Frames and like its cousins inpandas
, they are columnar data structures. - As opposed to
pandas
, the library offers native-C implementation for all datatypes, including strings.pandas
does it only for numeric types. - It offers fast data reading from CSV and other file formats.
- When working with
datatable
, we should store data on a disk in the same format as in memory. Thanks to that, we can use memory-mapping of the data on disk and work on out-of-memory datasets. This way, we avoid loading into memory more data than necessary for each particular operation. datatable
uses multi-threaded data processing to achieve maximum efficiency.- The library minimizes the amount of data copying.
- It is easy to convert
datatable
‘s Frames intopandas
/numpy
objects.
Useful references:
cuDF – ~4.5k GitHub stars
[cuDF](https://github.com/rapidsai/cudf)
is a GPU DataFrame library and is part of NVIDIA’s RAPIDS, a data science ecosystem spanning multiple open source libraries and leveraging the power of GPUs. cuDF provides an API similar to pandas and allows us to benefit from the performance gains without going into details of CUDA programming.
Key points about cuDF
:
pandas
-like API – in many cases, we just need to change one line of code to start benefiting from the power of GPUs.- built using the Apache Arrow columnar memory format.
cuDF
is a single-GPU library. However, it can leverage a multi-GPU setup in combination with Dask and the dedicateddask-cudf
library. With it, we are able to scalecuDF
across multiple GPUs on a single machine, or multiple GPUs across many machines in a cluster.- using
cuDF
requires a compatible NVIDIA GPU and some additional setup (updating the drivers, installing CUDA, etc.) - we should keep in mind that the best performance is obtained as long as the data reasonably fits into the GPU memory.
Useful references:
- https://github.com/rapidsai/cudf
- https://docs.rapids.ai/api/cudf/stable/
- https://docs.rapids.ai/api/cudf/stable/user_guide/10min-cudf-cupy.html
- https://docs.rapids.ai/api/cudf/stable/user_guide/10min.html
pyspark
In contrast to the libraries before, we actually first need to take a step back and describe what Spark is.
Apache Spark is a unified analytics engine for large-scale data processing, written in Scala. It is basically the library for handling large datasets (think 100GB+) for data science. There are multiple reasons for its popularity, including the following:
- it can be up to 100x faster than Hadoop,
- it achieves high performance for static, batch, and streaming data,
- it uses a state-of-the-art DAG (Directed Acyclic Graph) scheduler, a query optimizer, and a physical execution engine.
Spark works in a master-slave architecture, in which the master is actually called a "driver" and slaves are called "workers". When running a Spark application, Spark driver creates a context that is an entry point to the application. Then, all operations are executed on worker nodes, while the resources are managed by the Cluster Manager.
Spark comes with its own flavor of DataFrames. While they have functionalities similar to pandas
DataFrames, the key differences are that they are distributed, they have lazy evaluation and are immutable (no overwriting of data is allowed).
That is enough of an introduction to Spark, let’s focus on the part that is most relevant for this article, that is, scaling DataFrames. For that, we can use PySpark, which is a Python API for Spark.
Key things to know about PySpark:
- PySpark is a general-purpose, in-memory, distributed processing engine for efficient data processing in a distributed fashion.
- massive speed improvement – running calculations on PySpark can be 100x faster than using traditional systems.
- it has a different API than
pandas
and it does not integrate well with other libraries (e.g.matplotlib
for plotting, etc.). In general, it has a steeper learning curve thanpandas
. - We should be mindful when using wide transformations (that look at entire data across all nodes, for example, ordering or using
groupby
), as they are computation heavier than the narrow transformations (those looking at individual data in each node). - to use PySpark, there is quite some overhead we need to overcome – setting up a Spark (locally or in a cluster), having a JVM (Java Virtual Machine) on our computer, etc. This might be a show-stopper when we do not have one already running in our organization and setting it up would be an overkill for some smaller experiments. Alternatively, we could use managed cloud solutions like Databricks.
- with PySpark we can easily process data from Hadoop HDFS, AWS S3, and many other file systems. This also includes processing real-time data using Streaming and Kafka.
MLlib
is a wrapper over the PySpark and is basically Spark’s machine learning library. The API provided by theMLlib
library is quite easy to use and supports many algorithms for classification, regression, clustering, dimensionality reduction, etc.- Spark allows us to query DataFrames with both SQL ** and** Python . This can come in handy, as oftentimes some logic can be easier to write in SQL than remembering the exact PySpark API. As the work can be done interchangeably, you can use whichever you’d like to at that point.
- As Spark runs on a nearly-unlimited cluster of computers, there is effectively no limit on the size of datasets it can handle.
Useful references:
- https://www.youtube.com/watch?v=XrpSRCwISdk
- https://towardsdatascience.com/the-most-complete-guide-to-pyspark-dataframes-2702c343b2e8
Koalas – ~3k GitHub stars
We have mentioned that the biggest pain point with PySpark is the syntax which is different than pandas
and has a rather steep learning curve. That is exactly what Databricks wanted to solve with Koalas. The goal of this library is to make data scientists more productive when interacting with big data by implementing the pandas
API on top of Spark.
Some things to know about Koalas:
- you can have a single codebase that works both with
pandas
(smaller datasets) and with Spark (distributed datasets). You just need to replace the import statement frompandas
tokoalas
. - Koalas supports Spark ≤ 3.1 and is officially included in PySpark in Spark 3.2 as
pyspark.pandas
- Null values might be handled a bit differently.
pandas
uses NaNs (special constants) to indicate missing values, while Spark has a special flag on each value to indicate if a value is missing. - Lazy evaluation – as Spark is lazy in nature, some operations (e.g., creating new columns) only get performed when Spark needs to print or write the DataFrame.
- it is easy to convert a Koalas DataFrame into a
pandas
/PySpark DataFrame. - Koalas additionally supports standard SQL syntax with
ks.sql()
which allows executing Spark SQL query and returns the result in the form of a DataFrame. - Koalas and PySpark have very similar performance results as both use Spark behind the scenes. However, there can be some slight performance degradation when compared to pure PySpark. Most of the time, it is connected to the overhead of building the default index or the fact that some
pandas
and PySpark APIs share the same name, but have different semantics (for example, thecount
method). - Koalas DataFrames are externally slightly different than PySpark DataFrames. To implement the
pandas
DataFrame structure and its rich APIs that require implicit ordering, Koalas DataFrames have internal metadata that representspandas
-like indices and column labels mapped to the columns of the PySpark DataFrame. On the other hand, the PySpark counterparts tend to be more compliant with the relations/tables in relational databases. Thus, they do not have unique row identifiers. - Internally, Koalas DataFrames are built on top of PySpark DataFrames. Koalas translates
pandas
APIs into the logical plan of Spark SQL. The plan is then optimized and executed by Spark SQL engine.
Useful references:
- https://github.com/databricks/koalas
- https://koalas.readthedocs.io/en/latest/
- https://koalas.readthedocs.io/en/latest/getting_started/10min.html
polars – ~5k GitHub stars
polars
is an open-source DataFrame library that puts emphasis on speed. To achieve that, it is implemented in Rust with the Apache Arrow as its memory model. Until recently, the library that was the Python wrapper for polars
was called pypolars
, however, now it is also called polars
for simplicity.
Some key features of polars
:
- the API is similar to
pandas
, however, it actually is a bit closer to R’sdplyr
. polars
has two APIs – eager and lazy. The former is very similar to that ofpandas
, as the results are produced just after the execution is completed. On the other hand, the lazy API is more similar to Spark, where a plan is formed upon execution of a query. But the plan does not actually see the data until it is executed in parallel across all the cores of the CPU when we call thecollect
method.- plots are easy to generate and integrate with the most popular visualization tools.
polars
is currently one of the fastest (if not the fastest) DataFrame library (according to this benchmark) and supports DataFrames that may be simply too big forpandas
.- the speed of
polars
comes from utilizing all available cores of your machine. What differentiates it from other solutions is thatpolars
is written from the ground up with parallelization of DataFrame queries in mind, while tools like Dask parallelize existing single-threaded libraries (likenumpy
andpandas
).
Useful references:
- https://github.com/pola-rs/polars
- https://pola-rs.github.io/polars-book/user-guide/introduction.html
- https://www.ritchievink.com/blog/2021/02/28/i-wrote-one-of-the-fastest-dataframe-libraries/
Pandarallel – ~2k GitHub stars
pandarallel
(admit it, this one sounds a bit like a Pokémon) is an open-source library that parallelizes pandas
operations on all available CPUs.
When we call a parallelized function using pandarallel
, the following steps happen under the hood:
- the library initializes PyArrow Plasma shared memory,
- it creates one sub-process for each CPU, and then asks them to work on a part of the original DataFrame,
- it combines all the results in the parent process.
Some key points about pandarallel
:
- the library lets you parallelize the following
pandas
methods:apply
,applymap
,groupby
,map
, androlling
. - if your CPU uses hyper-threading (for example, 8 cores and 16 threads), only 8 cores will be used.
pandarallel
need twice the memory that standardpandas
operation would use. And it goes without saying, that the library should not be used if the data does not initially fit into memory withpandas
.
Useful references:
https://github.com/nalepae/pandarallel
Terality
Terality is the new kid on the block when it comes to pandas
replacements. It is a server-less data processing engine that makes pandas
as scalable and fast as Apache Spark (think 100 times faster than pandas
and capable of handling 100+ GBs of data) with neither infrastructure requirements nor any code change involved. Already sounds great! What’s the catch?
The biggest difference from the other libraries/approaches is that Terality is not open-source software. There are different types of subscriptions (including a free one to play around!) but overall you are charged per TBs of processed data. For more information on the pricing, please see this page.
Some things to know about Terality:
- 100% coverage of the
pandas
API. - Terality provides two new methods on top of existing
pandas
functionalities:to_csv_folder
andto_parquet_folder
. They allow us to easily split the original dataset into multiple smaller datasets. This feature is particularly useful when splitting the data into chunks and then analyzing each of them separately. - as the project is not open-source, we cannot say much about the underlying architecture behind its fast performance. What we do know is that the Terality team developed a proprietary data processing engine, so it is not a fork/flavor of Spark or Dask.
- as Terality is hosted, there is no infrastructure to manage, and memory is virtually unlimited.
- it removes the scalability issues of
pandas
, the complexity of Spark (setting up + different syntax), and the limitations of Dask/Modin. - as of February 2022, you can use Terality in combination with Google Colab.
- Terality is auto-scalable – regardless of the data size, our operations will be handled automatically with great speed. There is no need to manually scale the processing capabilities to match the size of the dataset. All the infrastructure is managed on Terality’s side, including turning things off after you are done with your processing.
- Terality works best when combined with a cloud storage solution for your data (Amazon S3, Azure Data Lake, etc.). That is because the alternative would be to load a local file, which could take quite some time (depending on your internet speed).
- naturally, there can be constraints/limitations when it comes to using a third-party solution with potentially sensitive data, especially when talking about company data. In terms of security, Terality provides secure isolation and ourdata is fully protected both in transit and during computations. Please refer to this website for more information on security.
- as of not that long ago, Terality can be deployed in your own AWS account. Thanks to such a self-hosted deployment, your data never leaves your AWS account. Such a feature can help you comply with data protection requirements and remove any doubts about the security of your data.
Useful references:
- https://www.terality.com/
- https://docs.terality.com/
- https://www.terality.com/post/terality-beats-spark-and-dask-h2o-benchmark
Conclusions
Don’t get me wrong, pandas
is a great tool that I use and will continue to use on a daily basis. However, it can happen that it is simply not enough for some specific use case. That is why in this article, I have provided an overview of the most popular pandas
alternatives.
Let me first answer a question that might be bobbing around in your head at this point: which solution is the best? As you might have guessed, the answer is: it depends. If you want to simply speed up your pandas
code on your local machine, probably Modin is a good starting point. If you already have a Spark cluster running, you can try PySpark/Koalas. For calculating some stats or visualizing massive datasets, Vaex might be a good starting point. Or if you want to use max speed without having to worry about setting up any infrastructure, then Terality can be the way to go.
Last but not least, such an article would not be complete without some reference to a benchmark test of some sort. When it comes to pure speed, H2O.ai prepared such a benchmark test for most of the Python libraries that are used for data processing. To evaluate the libraries, they performed data aggregation and joining of 2 datasets. For those tasks, they have used datasets of varying sizes: 0.2 GB, 5 GB, and 50 GB.
Do you have any experience with the libraries mentioned in the article? Or did I miss a library that you know of? I would be very curious to hear about your experience! You can reach out to me on Twitter or in the comments.
Liked the article? Become a Medium member to continue learning by reading without limits. If you use this link to become a member, you will support me at no extra cost to you. Thanks in advance and see you around!
You might also be interested in one of the following:
Make working with large DataFrames easier, at least for your memory