The world’s leading publication for data science, AI, and ML professionals.

Common Machine Learning Programming Errors in Python

Common Python Errors in Machine Learning

In this post I will go over some of the most common errors I come across in python during the model building and development process. For demonstration purposes we will use height/weight data which can be found here on Kaggle. The data contains gender, height in inches and weight in pounds.

The most common errors I have come accross in my experience are the following:

IMPORTS

  1. NameError
  2. ModuleNotFoundError
  3. AttributeError
  4. ImportError

READING DATA

  1. FileNotFoundError

SELECTING COLUMNS

  1. KeyError

DATA PROCESSING

  1. ValueError

We will build a simple linear regression model and modify the code to show how the above errors arise in practice.

First let’s import the data using pandas and print the first five rows:

import pandas as pd
df = pd.read_csv("weight-height.csv")
print(df.head())

As you can see the data set is very simple, with gender, height and weight columns. The next thing we can do is visualize the data using matplotlib and seaborn:

import matplotlib.pyplot as plt
plt.scatter(df['Weight'],  df['Height'])
plt.xlabel("Weight")
plt.ylabel("Height")
plt.show()

Looking at the scatter plot of the weight vs height we see that the relationship is linear.

Next let’s define our input (X) and output (y) and split the data for training and testing:

from sklearn.model_selection import train_test_split
import numpy as np
X = np.array(df["Weight"]).reshape(-1,1)
y = np.array(df["Height"]).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)

We can then define a linear regression model, fit to our training data, make predictions on the test set, and evaluate the performance of the model:

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print("R^2 Accuracy: ", reg.score(X_test, y_test))

The first error I’ll discuss is the NameError which occurs if, for example, I forget to import a package. In the following code I have removed "import numpy as np" :

from sklearn.model_selection import train_test_split
X = np.array(df["Weight"]).reshape(-1,1)
y = np.array(df["Height"]).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)

If I attempt to run the script with that line of code missing I get the following error:

I would receive similar messages for leaving out the import statements for seaborn, matplotlib and pandas:

Another issue is accidentally trying to import a package that doesn’t exist due to misspelling, which results in a ModuleNotFoundError. For example if I misspell ‘pandas’ as ‘pandnas’:

import pandnas as pd

Or if I forgot ‘pyplot’ in the matplotlib scatterplot import we get an AttributeError:

import matplotlib as plt

Similarly if I forgot the linear_regression and model_selection attributes in the sklearn import, I’ll get an ImportError:

from sklearn import LinearRegression 
from sklearn import train_test_split

In terms of reading files, if I misspell the name of the file I get a FileNotFoundError:

df = pd.read_csv("weight-heigh1t.csv")

Additionally, if I try to select a column from a pandas data frame that doesn’t exist I get a KeyError :

plt.scatter(df['Weight1'],  df['Height'])

If I forget to convert the pandas series for "Weight" and "Height" into numpy arrays I get a ValueError. This is because sklearn methods only accept numpy arrays. I frequently find myself forgetting this simple step of converting from a pandas series to a numpy array:

X = df["Weight"]
y = df["Height"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)
reg = LinearRegression()
reg.fit(X_train, y_train)

or If I forget to reshape the numpy array into a 2 dimensional array I also get a ValueError:

X = np.array(df["Weight"])
y = np.array(df["Height"])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)
reg = LinearRegression()
reg.fit(X_train, y_train)

Another common cause of a ValueError is when carrying out the train test split. I often forget the order of the X and y arrays:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)

Where I switch X_test and y_train:

X_train, y_train, X_test, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)

which gives the following error upon fitting:

Finally, when trying to fit a model data corresponding to a specific category or population I often come across the issue of not having enough data. Let’s filter our dataframe to replicate this issue. Let’s filter the data to only include records where the ‘Weight’ = 241.893563. This will result in exactly one row of data:

df = df[df['Weight'] == 241.893563]

If we try to build our model we get the following error in the line where we split out data::

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
print("R^2 Accuracy: ", reg.score(X_test, y_test))

And if we try to fit we get the following error::

#X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)
reg = LinearRegression()
reg.fit(X, y)

Finally, if the data has missing or infinite values the fitting while throw an error. Let’s redefine the weight column with ‘nan ‘ (Not a Number) values to generate this error:

df['Weight'] = np.nan
X = np.array(df["Weight"]).reshape(-1,1)
y = np.array(df["Height"]).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)

reg = LinearRegression()
reg.fit(X_train, y_train)

We would get the same error message with infinite values:

df['Weight'] = np.inf
X = np.array(df["Weight"]).reshape(-1,1)
y = np.array(df["Height"]).reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size = 0.33)
reg = LinearRegression()
reg.fit(X_train, y_train)

In this post we reviewed different errors that arise when developing models in python. We reviewed errors related to importing packages, reading files, selecting columns and processing data. Having solid knowledge of the different types of errors that arise when developing Machine Learning models can be useful when productionizing machine learning code. Having this knowledge can prevent errors from occurring as well as inform the logic that can be used to catch these errors when they occur.

There are many more errors that I come across daily but the errors I’ve listed in this post are most common in my experience. I hope this post was useful. The code from this post is available on GitHub. Thank you for reading!


Related Articles