As of late, I’ve been seeing a lot of articles about speeding up Python and Pandas. Indeed I even wrote one myself and demonstrated how to use Rapids and CuDF to get brilliant performance from a capable GPU.
But why do so many take to their computers to write about speeding up Python and Pandas? is Python slow, or is it more how we code? Those questions have been on my mind lately as I encounter fresh articles. So I wanted to do some experiments of my own to find answers. Read on.
I should say that Python is somewhat similar to those train tracks in the photo above. If a Python job uses one rail line, using two or more lines could be quicker. Modern CPUs are also multi-core and offer the user four or more rail lines to shift workloads. I took a snap of my current system, and you can look at the system in motion below.
The system is chugging along on one rail line at 100%. If it takes a minute, that ought not to be terrible! Try waiting 10 minutes for Excel to update all formulas in a workbook.
A workload that needs to be done
I started my review by defining a workload I wanted to get done by applying the single-train track, and multiple-train track approaches.
As in my last article, I used a large publicly available dataset. The 7+ Million Company dataset[1] is licensed under the creative commons CC0.01. "You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission."
The file is 1.1 GB and has 7+ million rows and 11 columns.
path = "/home/david/Downloads/"
file = "companies_sorted.csv"
filePath = path+file
work = [filePath,filePath,filePath,filePath,filePath,filePath, filePath, filePath, filePath,filePath,filePath, filePath, filePath, filePath,filePath,filePath, filePath, filePath, filePath,filePath,filePath, filePath, filePath, filePath]
frames = []
My variable – work – is a list of 24 file paths. It happens to be the same file 24 times – almost 24.5GB of data. 7.7Million rows, 11 columns X 24 = 185 million rows with 11 columns. Throw that at Excel and see! I often had 25+ Apache or Nginx log files leaving this a valid scenario.
I hope you agree that the workload is significant and should take a while on a single or multi-core system.
The worker
When there is work to be done, we generally hire a bunch of workers to do it. So let us discuss my worker design.
def worktodo(file):
df = pd.read_csv(file)
df=df.groupby('size range').agg({'year founded': ['min', 'max', 'count'], 'country': lambda x: x.nunique(), 'current employee estimate': 'median'}).reset_index()
cols = ['size range','min year','max year','count','country','employee estimate']
df.columns=cols
return df
def summary(frames):
frame = pd.concat(frames)
print(frame.shape)
print(frame.groupby('size range').agg({'min year': 'min', 'max year': 'max', 'count': 'mean','employee estimate': 'mean', 'country': 'mean'}))
print(" ")
The function ‘worktodo ‘— ‘naturally I have no imagination’ – receives a file, reads that into a Pandas dataframe, and then uses groupby to summarise the file. I do a slight cleanup and then return the dataframe. We need to do this one 24 times. The job moves from 7.7m X 11 input to 8 rows X 6 columns output.
The function summary takes a list of data frames and makes a final summary of all the individual files.
The procedure
Having work to do and workers to do it won’t be sufficient. You also need the workers’ desk procedures or processes to do the job.
def singleCpu():
start_time = time.perf_counter()
frames = []
for todo in work:
frames.append(worktodo(todo))
summary(frames)
finish_time = time.perf_counter()
print(f"Program finished in {finish_time-start_time} seconds")
del frames
I first defined a process for a single train track or CPU – the single thread scenario – the most common approach.
Not much magic to report – iterate over the work to be done and pass it off to the worker – getting the summary back and stashing that in a list called frames. Once all the files are processed, we hand that raw output to the summary function and get the final product.
if __name__ == "__main__":
print ("-->xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
print (f"work to be done: {len(work)}")
print ("<------xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
singleCpu()
start_time = time.perf_counter()
with Pool() as mp_pool:
results = mp_pool.map(worktodo, work)
summary(results)
finish_time = time.perf_counter()
print(f"Program finished in {finish_time-start_time} seconds")
For the contrary argument – I used multi-processing – and the Pool class. Two lines of code essentially chunk up the work and spread it to the available workers.
Results
Having defined the workload, the workers, and the process to be followed, there remained nothing but to execute the mission and drive through the pile of work.
The highlights are:-
- Twenty-four files to be processed – 24.5 GIG, a massive 185 million rows.
- Two outcomes were delivered: a conventional single-thread approach and a multi-processing strategy using Pool()
- The final dataframe is 192 rows by six columns – representing a summary for each member of the size range variable.
- Captured some screenshots of the system monitor
From the terminal window (above), you will note that the conventional approach took 411.92 seconds, whilst the Pool() procedure took 140.03 seconds. Seven minutes versus 2.33 minutes, so not an X4 result when spreading from one core to four. There is always a driver program that has to coordinate between the workers. Sound familiar? That is Spark or distributed computing language. Still, the code is three times faster, which is good, Right?
Some screenshots I captured of the system working…
Using all four cores
Using just that single core
But running four trains instead of 1 must be more expensive! No!
We have solar panels, which allow a dual electricity supply scenario and some exciting analytics. During the experiment, I captured the power consumption. Not very scientific, but there was a doubling of power consumed during the run time. We went from .1kw to .2kw. I do know that the power supply on my system is 600watt.
So using 1 line, we burned, let’s say, .1kw for 7 minutes. Using four lines, we burned, let’s say, .2kw for 2.33 minutes.
The multi-processing leg used 31.77% less power than the single CPU approach. We in Data Science and IT need to contribute our 50% reduction in Green House Gas emissions in the run-up to 2030. IT, Deep learning, and Data Analytics are significant users of Green House Gas and Fossil fuel-driven Electricity [2].
Closing
So I have had questions on my mind for some time now. Why do so many take to their computers to write about speeding up Python and Pandas? is Python slow, or is it more how we code?
For the first part, I think people will always write about techniques that improve our practice and offer insight into how things could be done differently. Is Python slow? Python is an interpreted language that will run slightly slower than a compiled language. Still, you have to offset the productivity you get using Python from that 1 minute run time and have careful expectations. Creating a C++ or Java application takes a lot longer for that less than 1 minute run time.
The question of how we code is coming into sharp focus. With Green House Gas emissions and reduction targets, we need to focus on IT and be efficient with our algorithms. This article shows a big difference in results and power based on algorithm design ( single versus multi-core operations). Long-running ETL jobs, weeks of deep neural network training, and other tasks may soon become part of the Esg debate and come under scrutiny for Green House Gas emission reduction challenges.
Code Repository
As always, you can find the script on my GitHub account.
ReadingList/parallelism.py at main · CognitiveDave/ReadingList
Not a medium member! Why not join – for $50 bucks, it is good value.
References
[1] The 7+ Million Company dataset is licensed under the creative commons CC0.01. You can request a copy directly from People Data Labs and read more about them here. The data is used here for loading in memory and creating tactical summaries.
[2] https://hpi.de/news/jahrgaenge/2020/hpi-startet-clean-it-initiative.html "The HPI bundles its research and teaching activities in the clean IT initiative. In this way, HPI contributes to developing climate-friendly digital solutions and AI applications through first-class training offers and the sustainable and energy-efficient use of IT systems and research contributions. These solutions support all areas of social and economic life, especially health, sustainable mobility and the promotion of equality."