The world’s leading publication for data science, AI, and ML professionals.

How to Reduce Python Runtime for Demanding Tasks

Practical techniques to accelerate heavy workloads with GPU optimization in Python

Photo by Mathew Schwartz on Unsplash
Photo by Mathew Schwartz on Unsplash

One of the biggest challenges that data scientists face is the lengthy runtime of Python code when handling extremely large datasets or highly complex machine learning/deep learning models. Many methods have proven effective for improving code efficiency, such as dimensionality reduction, model optimization, and feature selection – these are algorithm-based solutions. Another option to address this challenge is to use a different programming language in certain cases. In today’s article, I won’t focus on algorithm-based methods for improving code efficiency. Instead, I’ll discuss practical techniques that are both convenient and easy to master.

To illustrate, I’ll use the Online Retail dataset, a publicly available dataset under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. You can download the original dataset Online Retail data from the UCI Machine Learning Repository. This dataset contains all the transactional data occurring between a specific period for a UK-based and registered non-store online retail. The target is to train a model to predict whether the customer would make a repurchase and the following python code is used to achieve the objective.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from itertools import product

# Load dataset from Excel file
data = pd.read_excel('Online Retail.xlsx', engine='openpyxl')

# Data preprocessing
data = data.dropna(subset=['CustomerID'])  
data['InvoiceYearMonth'] = data['InvoiceDate'].astype('datetime64[ns]').dt.to_period('M') 

# Feature Engineering
data['TotalPrice'] = data['Quantity'] * data['UnitPrice']
customer_features = data.groupby('CustomerID').agg({
    'TotalPrice': 'sum',
    'InvoiceYearMonth': 'nunique',  # Count of unique purchase months
    'Quantity': 'sum'
}).rename(columns={'TotalPrice': 'TotalSpend', 'InvoiceYearMonth': 'PurchaseMonths', 'Quantity': 'TotalQuantity'})

# Create the target variable
customer_features['Repurchase'] = (customer_features['PurchaseMonths'] > 1).astype(int)

# Train-test split
X = customer_features.drop('Repurchase', axis=1)
y = customer_features['Repurchase']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Define different values for parameters
n_estimators_options = [50, 100, 200]
max_depth_options = [None, 10, 20]
class_weight_options = [None, 'balanced']

# Train the RandomForestClassifier with different combinations of parameters
results = []
for n_estimators, max_depth, class_weight in product(n_estimators_options, max_depth_options, class_weight_options):
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, class_weight=class_weight, random_state=42)
    clf.fit(X_train, y_train)
    accuracy = clf.score(X_test, y_test)
    results.append((n_estimators, max_depth, class_weight, accuracy))

It takes some time to run the code because of 541,909 rows of data processed. In industries like e-commerce or social media, data scientists often process even larger datasets – sometimes billions or even trillions of rows with more features. And there are combinations of structured and unstructured data, text, images or videos – these various types of data undoubtedly increase the workload. Therefore, it’s critically important to apply some techniques to optimize code efficiency. I’ll stick to the Online Retail data to simplify the explanations. Before introducing these techniques, I measured the time required for running the entire Python script, reading the Online Retail data, and training the machine learning model.

import time

# Function to calculate and print elapsed time
def time_execution(func, *args, **kwargs):
    start_time = time.time()
    result = func(*args, **kwargs)
    elapsed_time = time.time() - start_time
    return result, elapsed_time

# 1. Full Python code execution timing
def complete_process():
    # Load dataset from Excel file
    data = pd.read_excel('Online Retail.xlsx', engine='openpyxl')

    # Data preprocessing
    data = data.dropna(subset=['CustomerID'])
    data['InvoiceYearMonth'] = data['InvoiceDate'].astype('datetime64[ns]').dt.to_period('M')

    # Feature Engineering
    data['TotalPrice'] = data['Quantity'] * data['UnitPrice']
    customer_features = data.groupby('CustomerID').agg({
        'TotalPrice': 'sum',
        'InvoiceYearMonth': 'nunique',
        'Quantity': 'sum'
    }).rename(columns={'TotalPrice': 'TotalSpend', 'InvoiceYearMonth': 'PurchaseMonths', 'Quantity': 'TotalQuantity'})
    customer_features['Repurchase'] = (customer_features['PurchaseMonths'] > 1).astype(int)

    # Train-test split
    X = customer_features.drop('Repurchase', axis=1)
    y = customer_features['Repurchase']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Model training with parameter combinations
    results = []
    for n_estimators, max_depth, class_weight in product(n_estimators_options, max_depth_options, class_weight_options):
        clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, class_weight=class_weight, random_state=42)
        clf.fit(X_train, y_train)
        accuracy = clf.score(X_test, y_test)
        results.append((n_estimators, max_depth, class_weight, accuracy))

    return results

# Measure total execution time
results, total_time = time_execution(complete_process)
print(f"Total execution time for the entire process: {total_time} seconds")

# 2. Timing the Excel file reading
def read_excel():
    return pd.read_excel('Online Retail.xlsx', engine='openpyxl')

# Measure time taken to read the Excel file
_, read_time = time_execution(read_excel)
print(f"Time taken to read the Excel file: {read_time} seconds")

# 3. Timing the model training
def train_model(X_train, y_train):
    results = []
    for n_estimators, max_depth, class_weight in product(n_estimators_options, max_depth_options, class_weight_options):
        clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, class_weight=class_weight, random_state=42)
        clf.fit(X_train, y_train)
        accuracy = clf.score(X_test, y_test)
        results.append((n_estimators, max_depth, class_weight, accuracy))
    return results

# Measure time taken to train the model
_, train_time = time_execution(train_model, X_train, y_train)
print(f"Time taken to train the model: {train_time} seconds")
Screenshot by the author
Screenshot by the author

The entire process takes nearly 20 seconds, with almost 18 seconds spent on reading the data file.


Solution One: Enable GPU and Set Memory Growth

Compared to CPUs, GPUs are ideal for handling large datasets and complex models, like deep learning, as they support parallel processing. Sometimes, developers forget to set memory growth, which causes the Gpu to attempt to allocate all the memory for the model at start.

So what is memory growth? Why is it so important when using a GPU? Memory growth is the mechanism which allows the GPU to allocate memory incrementally as needed, rather than reserving a large block of memory upfront. If memory growth is not set and the model is large, there might be not enough available memory, which can result in an ‘out-of-memory’ (OOM) error. In cases where multiple models are running simultaneously, one model consumes all the GPU memory and prevent other models from accessing the GPU.

In short, setting memory growth properly enables efficient GPU usage, enhances flexibility, and improves robustness of the training process for large dataset and complex models. After enabling GPU and setting memory growth, the code performs as follows:

import tensorflow as tf
from sklearn.model_selection import train_test_split
import pandas as pd
from itertools import product
import time

# Enable GPU and Set Memory Growth
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

# Function to calculate and print elapsed time
def time_execution(func, *args, **kwargs):
    start_time = time.time()
    result = func(*args, **kwargs)
    elapsed_time = time.time() - start_time
    return result, elapsed_time

# Read Excel File
def read_excel():
    return pd.read_excel('Online Retail.xlsx', engine='openpyxl')

# Complete Process Function
def complete_process():
    # Load dataset from Excel file
    data = read_excel()

    # Data preprocessing
    data = data.dropna(subset=['CustomerID'])
    data['InvoiceYearMonth'] = data['InvoiceDate'].astype('datetime64[ns]').dt.to_period('M')

    # Feature Engineering
    data['TotalPrice'] = data['Quantity'] * data['UnitPrice']
    customer_features = data.groupby('CustomerID').agg({
        'TotalPrice': 'sum',
        'InvoiceYearMonth': 'nunique',
        'Quantity': 'sum'
    }).rename(columns={'TotalPrice': 'TotalSpend', 'InvoiceYearMonth': 'PurchaseMonths', 'Quantity': 'TotalQuantity'})
    customer_features['Repurchase'] = (customer_features['PurchaseMonths'] > 1).astype(int)

    # Train-test split
    X = customer_features.drop('Repurchase', axis=1)
    y = customer_features['Repurchase']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Model training with parameter combinations
    results = []
    n_estimators_options = [50, 100]
    max_depth_options = [None, 10]
    class_weight_options = [None, 'balanced']

    for n_estimators, max_depth, class_weight in product(n_estimators_options, max_depth_options, class_weight_options):
        clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, class_weight=class_weight, random_state=42)
        clf.fit(X_train, y_train)
        accuracy = clf.score(X_test, y_test)
        results.append((n_estimators, max_depth, class_weight, accuracy))

    return results

# Measure total execution time
results, total_time = time_execution(complete_process)
print(f"Total execution time for the entire process: {total_time} seconds")

# Measure time taken to read the Excel file
_, read_time = time_execution(read_excel)
print(f"Time taken to read the Excel file: {read_time} seconds")

# Measure time taken to train the model
def train_model(X_train, y_train):
    results = []
    n_estimators_options = [50, 100]
    max_depth_options = [None, 10]
    class_weight_options = [None, 'balanced']

    for n_estimators, max_depth, class_weight in product(n_estimators_options, max_depth_options, class_weight_options):
        clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, class_weight=class_weight, random_state=42)
        clf.fit(X_train, y_train)
        accuracy = clf.score(X_test, y_test)
        results.append((n_estimators, max_depth, class_weight, accuracy))

    return results

_, train_time = time_execution(train_model, X_train, y_train)
print(f"Time taken to train the model: {train_time} seconds")
Screenshot by the author
Screenshot by the author

The time taken to train the model decreased significantly from 1.9 seconds to 0.6 seconds. But it’s observed that the time taken to read the Excel file didn’t reduce considerably. Therefore, another technique is required to improve the efficiency of loading and processing data – Disk I/O optimization with data pipeline prefetching.


Solution Two: Disk I/O Optimization with Data Pipeline Prefetching

Disk Input/Output can become a bottleneck when reading very large datasets. TensorFlow’s tf.data API effectively optimizes input pipelines and improves data loading and processing efficiencies by allowing asynchronous operations and parallel processing. The reason why this solution reduces time of loading and processing data is because it creates a continuous, optimized flow of data from disk to the processing pipeline by minimizing delays associated with reading large datasets and by aligning with parallel data processing. The updated code for loading the Online Retail.xlsx data using tf.data is as follows:

import time
import pandas as pd
import tensorflow as tf

# Function to calculate and print elapsed time
def time_execution(func, *args, **kwargs):
    start_time = time.time()
    result = func(*args, **kwargs)
    elapsed_time = time.time() - start_time
    return result, elapsed_time

# Function to load and preprocess dataset using tf.data
def load_data_with_tfdata(file_path, batch_size):
    # Define a generator function to yield data from the Excel file
    def data_generator():
        data = pd.read_excel(file_path, engine='openpyxl')
        for _, row in data.iterrows():
            yield dict(row)

    # Create a tf.data.Dataset from the generator
    dataset = tf.data.Dataset.from_generator(
        data_generator,
        output_signature={col: tf.TensorSpec(shape=(), dtype=tf.float32) for col in data.columns}
    )

    # Apply shuffle, batch, and prefetch transformations
    dataset = dataset.shuffle(buffer_size=1000).batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)

    return dataset

# Load and preprocess dataset using tf.data.Dataset
file_path = 'Online Retail.xlsx'
batch_size = 32
dataset, data_load_time = time_execution(load_data_with_tfdata, file_path, batch_size)
print(f"Time taken to load and preprocess data with tf.data: {data_load_time} seconds")
Screenshot by the author
Screenshot by the author

The time taken to load data dropped significantly from 18 seconds to 0.05 seconds.

It’s necessary to define batch_size properly because the size of data processed in each step affects the memory consumption and computation efficiency. If the batch size is not set, it may default to 1, which makes the data loading and processing or the model training very inefficient. When setting a batch size too large or too small, it can lead to inefficient training, memory errors, slower convergence, or suboptimal model performance.


Conclusion

GPUs are well-suited for extremely large data sets and highly complex models, but without proper parameter settings, their advantages can hardly be beneficial. Enabling GPU memory growth optimizes GPU usage and prevents memory errors. And Disk I/O optimization with data pipeline prefetching significantly reduces data loading and processing time. These techniques together provide practical and impactful solutions for overcoming challenges in day-to-day workloads.


Related Articles