The world’s leading publication for data science, AI, and ML professionals.

How Can Pandas Cope With Big Datasets?

Strategies to manage larger quantities of data

Photo by Kayla S: https://www.pexels.com/photo/a-panda-bear-in-the-cage-4444036/
Photo by Kayla S: https://www.pexels.com/photo/a-panda-bear-in-the-cage-4444036/

Pandas is arguably the most popular module when it comes to data manipulation with Python. It has tremendous utility, contains a vast variety of features, and boasts substantial community support.

That being said, Pandas has one glaring shortcoming: its performance levels drop with larger datasets.

The computational demand of processing larger datasets with Pandas can incur long run times and may even result in errors due to insufficient memory.

While it might be tempting to pursue other tools that are more adept at dealing with larger datasets, it is worthwhile to first explore the measures that can be taken to handle vast quantities of data with Pandas.

Here, we cover the strategies users can implement to conserve memory and process vast quantities of data with Pandas.

Note: Each strategy will be demonstrated with a fake dataset generated by Mockaroo.

1. Load less data

Removing columns from a data frame is a common step in data preprocessing.

Oftentimes, the columns are omitted after the data is loaded.

For instance, the following code loads the mock dataset and then omits all but 5 columns.

While this is a feasible approach in most cases, it is wasteful as you are using a lot of memory to load data that is not even required. We can gauge the memory usage of the mock dataset with the memory_usage function.

Code Output (Created By Author)
Code Output (Created By Author)

A preferable solution would be to omit unwanted columns during the data loading process. This will ensure that memory is only used for the relevant information.

This can be accomplished with the usecols parameter, which allows users to select the columns to include while loading the dataset.

Code Output (Created By Author)
Code Output (Created By Author)

The inclusion of the parameter alone decreases memory consumption by a significant degree.

2. Use memory-efficient data types

A lot of memory can be saved just by selecting the appropriate data types for the variables in question.

If the user does not explicitly select the data type for each variable, the Pandas module will assign it by default.

Code Output (Created By Author)
Code Output (Created By Author)

While this can be a convenient feature, the data types assigned to each column may not be ideal in terms of memory efficiency.

A key step in reducing memory usage lies in manually assigning the variables with the most memory-efficient data type.

Data Types For Numeric Variables

The Pandas module uses the int64 and float64 data types for numeric variables.

The int64 and float64 data types accommodate values with the greatest magnitude and precision. However, in return, these data types require the most amount of memory.

Here is the overall memory consumption of the mock dataset.

Code Output (Created By Author)
Code Output (Created By Author)

Fortunately, variables that deal with numbers with smaller magnitudes or precision don’t need such memory-consuming data types.

For example, in the mock dataset, smaller data types will suffice for variables like age, weight, income, and height. Let’s see how the memory of the numeric data changes when assigning new data types for these variables.

Code Output (Created By Author)
Code Output (Created By Author)

The simple act of converting data types can reduce memory usage considerably.

Caution: Using a data type that doesn’t accommodate the variables’ values will lead to information loss. Be careful when assigning data types manually.

Note that the income column was assigned the int32 data type instead of the int8 data type since the variable contains larger values.

To highlight the importance of selecting the correct data type, let’s compare the original income values in the dataset with the income values with the int32 and int8 data types.

Code Output (Created By Author)
Code Output (Created By Author)

As shown by the output, choosing the wrong data type (int8 in this case) will alter the values and hamper the results of any subsequent data manipulation.

Having a clear understanding of your data and the range of values afforded by the available data types (e.g., int8, int16, int32, etc.) is essential when assigning data types for the variables of interest.

For memory efficiency, a good practice is to specify data types while loading the dataset with the dtype parameter.

Data Types For Categorial Variables

Memory can also be saved by assigning categorical variables the "category" data type.

As an example, let’s see how the memory consumption changes after assigning the "category" data type to the gender column.

Code Output (Created By Author)
Code Output (Created By Author)

Clearly, the conversion yields a significant reduction in memory usage.

However, there is a caveat to this approach. Columns with the "category" data type consume more memory when it contains a larger number of unique values. Thus, this conversion is not viable for every variable.

To highlight this, we can examine the effect of this conversion on all of the categorical variables in the data frame.

Code Output (Created By Author)
Code Output (Created By Author)

As shown by the output, although the gender and job columns have less memory usage after the conversion, the first_name and last_name columns have greater memory usage. This can be attributed to a large number of unique first names and last names present in the dataset.

For that reason, exercise caution when assigning columns with the "category" data type when looking to preserve memory.

3. Load data in chunks

For datasets that are too large to fit in memory, Pandas offers the chunk_size parameter, which allows users to decide how many rows should be imported at each iteration.

When assigning a value to this parameter, the read_csv function will return an iterator object instead of an actual data frame.

Code Output (Created By Author)
Code Output (Created By Author)

Obtaining the data will require iterating through this object. By segmenting the large dataset into smaller pieces, data manipulation can be carried out while staying within the memory constraints.

Here, we iterate through each subset of the dataset, filter the data, and then append it to a list.

After that, we merge all elements in the list together with the concat function to obtain one comprehensive dataset.

Code Output (Created By Author)
Code Output (Created By Author)

Limitations of Pandas

Pandas may have features that account for larger quantities of data, but they are insufficient in the face of "big data", which can comprise many gigabytes or terabytes of data.

The module carries out its operations in one core of the CPU. Unfortunately, performing tasks with one processor is simply infeasible once memory usage and computational demand reaches a certain level.

For such cases, it is necessary to implement techniques like parallel processing, which entails running a task across multiple cores of a machine.

Python provides libraries like Dask and PySpark that incorporate parallel processing and enable users to execute operations at much greater speeds.

That being said, these tools mainly specialize in handling data-intensive tasks and may not offer the same features as Pandas. So, it is best to not rely on them unless it is necessary.

Conclusion

Photo by Prateek Katyal on Unsplash
Photo by Prateek Katyal on Unsplash

While Pandas is mainly used for small to medium-sized data, it shouldn’t be shunned for tasks that use marginally larger datasets. The module possesses features that help accommodate greater quantities of data, albeit to a limited extent.

The Pandas library has stood the test of time and remains the go-to library for data manipulation, so don’t be too eager to jump to other solutions unless you absolutely have to.

I wish you the best of luck with your Data Science endeavors!


Related Articles