A Case Study in Optimizing Programs Through Code Review

Continuous improvement of data science projects requires commitment to the grind

Dan Wilmott
Towards Data Science

--

Photo by Max Duzij on Unsplash

One of the more underappreciated aspects of organizations with relatively mature data science teams is a strong culture of continuous improvement. In the context of technical teams responsible for the creation and maintenance of complex software solutions, it’s especially important to engage in practices that improve the overall quality of a team’s code base. Whether it’s existing processes that are already in production or standardized packages/code to accomplish repeatable tasks, periodic & rigorous code review benefits both primary and tertiary stakeholders by reducing the possibility of errors, security issues, and resource usage.

Reviewing code someone else has written, often with little documentation or context, can be difficult and unpleasant at times. But that’s how the sausage is made.

Here I illustrate an example where I was able to significantly reduce max compute space and overall processing time required for a program to run. What’s really cool is I wasn’t initially seeking to do that; it arose from an initiative designed to retool that program to perform better on a set of established KPIs. Sometimes the best things happen when you aren’t expecting them.

First, some context. At a certain point, the program (written in Python) takes a random sample of a set of objects and records key metrics for that group. It does this many times. The objective is to identify a sample of objects that maximize the values of the key metrics, to be used at a later time. While reviewing the code for correctness, I identified something that significantly added to overall runtime and memory usage that didn’t need to be there.

The issue arose in how the results of the random sample was being stored. Since the sample was needed in later steps, the program initially created a list that stored a pandas DataFrame of each sample for each iteration using a for loop. While this served it’s purpose, the size of the list in memory increased as a function of two variables: the number of iterations in the for loop, as well as the size of the sample being taken. For relatively small samples and iterations, this process runs fine. But what happens if you increase the sample size from, say, 1,000 to 100,000 and the number of iterations from 5,000 to 500,000? It can drastically alter required memory usage— and regardless of your tech stack, inefficient programs cost real money to an organization through unnecessary compute resources and wasted time.

Quickly, let’s build an example set of data to illustrate the issue. We’ll use an example of shopper IDs with sales for a specific time frame; it could be sales specific to a set of products, sales channel, geographic location, etc. — use your imagination!

import pandas as pd
import numpy as np
# Set how many iterations to run, & size of ID sample in each iteration
n_iters = 100000
smpl_sz = 500000
# Create a sample df with 10MM unique IDs
# - Generate some dummy sales data to play with later
id_samp = pd.DataFrame(range(10000000,20000000))
id_samp.columns = ['Customer_ID']
id_samp['Sales_Per_ID'] = np.random.normal(loc = 1000, scale = 150, size = len(id_samp))
print(id_samp.head())
Image by author of output generated from above code in Spyder

As discussed, the program initially stored each sample as it was generated in a master list — which introduces sample size into the memory storage equation. This can get even worse with more expensive data objects, i.e. strings or user-created objects vs. integers.

# Original version
# Initialize data objects to store sample, info for each iteration’s sample
metric_of_interest = []
samples = []
# In the loop, store each sample as it's created
for i in range(n_iters):
samp = id_samp.sample(n = smpl_sz, axis = ‘rows’, random_state = i)
samples.append(samp)
# Do something with the sample & record add’l info as necessary
metric_of_interest.append(samp[‘Sales_Per_ID’].mean())

The key here is that we don’t have to store each sample as it’s created in order to access it later on. We can take advantage of the inherent properties of the random sampling function to use a random seed for reproducibility. I won’t delve into the use of random seeds as a best practice; but here’s another Medium article with a pretty thorough explanation, and here you can see some NumPy and Pandas documentation about using seeds/random states. The main takeaway is that an integer value can be used to “pick the start” of the sampling process; so if you store the value you can reproduce the sample. In this way, we were able to eliminate the effect of sample size and data types on memory usage by optimizing our storage methodology.

The result will be to create a randomly selected group of integers with 1 value per loop iteration to create samples. Below I show my own unique method for randomly selecting those integers, but this can be achieved in many ways.

# Create a random sample of integers for use as ID sample random state seed
#Here, we pull a seq of nums 50x greater than num iters to samp from
rndm_st_sd_pop = pd.Series(range(0,n_iters*50))
rndm_st_sd_samp = rndm_st_sd_pop.sample(n = n_iters, axis = ‘rows’)
del(rndm_st_sd_pop)
print(rndm_st_sd_samp.head())

Note: the index and random seed value are equal in this example

Image by author of output generated from above code in Spyder

Now, you can iterate through your random seed sample and feed each value as a parameter to the sampling function. Keep in mind that this parameter should exist for any sampling function worth using, regardless of language.

# Initialize data object(s) to store info for each iter’s sample
metric_of_interest = []
# In the loop, use the random state/seed to produce a sample you can easily reproduce later
for i in rndm_st_sd_samp:
samp = id_samp.sample(n = smpl_sz, axis = ‘rows’, random_state = i)
# Do something with the sample & record add’l info as necessary
metric_of_interest.append(samp[‘Sales_Per_ID’].mean())
# Bind the info saved for each iter to resp seed val for easy lookup
sample_data = pd.DataFrame(
{'Avg_Sales_Per_ID': metric_of_interest,
'Smpl_Rndm_St_Seed': rndm_st_sd_samp })
sample_data.reset_index(inplace = True)
print(sample_data.head())
Image by author of output generated from above code in Spyder

In this case, a deep understanding of the packages and specific language of use coupled with a standardized process for code review led to materially improved runtime performance. This was an example of just one way to impact the performance and reliability of your code, and hopefully inspires you to take a fresh look at something you wrote a long time ago. As for further information on formally reviewing code, here is a really nice discussion of the topic from Stack Overflow’s blog; and there is a Stack Exchange community devoted to the topic as well.

--

--