
When revisiting a problem you’ve worked on in the past and finding out that the code doesn’t work is frustrating. Making your Data Analysis experiments reproducible saves time for you and others in the long term. These tips will help you to write reproducible pandas code, which is important when you are working in a team, on personal projects that you plan to revisit in the future or share with others.
To run the examples download this Jupyter notebook.
Output Versions
To make your pandas experiment reproducible, start with outputting the system information and versions of Python packages. You (or your colleague) can thank me later. Some functions may be deprecated, broken or are unavailable in older versions of a certain package. Note, Python packages are intentionally printed out with == so that you can use the output directly with pip to install them.
I use the following template to output system information and versions of packages.
import os
import platform
from platform import python_version
import jupyterlab
import numpy as np
import pandas as pd
print("System")
print("os name: %s" % os.name)
print("system: %s" % platform.system())
print("release: %s" % platform.release())
print()
print("Python")
print("version: %s" % python_version())
print()
print("Python Packages")
print("jupterlab==%s" % jupyterlab.__version__)
print("pandas==%s" % pd.__version__)
print("numpy==%s" % np.__version__)

Say NO to automatic imports
Once I had this great idea that instead of writing each time import pandas as pd, import numpy as np
, etc. Jupyter Notebook could automatically import them.
Sounds great? Well, it didn’t end well. After a while, I forgot about my custom configuration and soon later I got questions like "Did you even run the code because it fails on the first line!", "How does this code work on your machine?".
Say NO to automatic imports.
Set Seeds
When using randomly generated data, setting the seed is a must. In case you are using a Machine Learning model, you should also set the random seed (if available) so that the model returns deterministic output.
Let’s look at the example. We set the random seed and output a random sample with 10 pseudorandom numbers. As expected the second run has different pseudorandom numbers than the first one. Note, pandas uses numpy under the hood so we need to set the seed with numpy.
np.random.seed(42)
np.random.random_sample(10)
# Output
array([0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864,
0.15599452, 0.05808361, 0.86617615, 0.60111501, 0.70807258])
np.random.random_sample(10)
# Output
array([0.02058449, 0.96990985, 0.83244264, 0.21233911, 0.18182497,
0.18340451, 0.30424224, 0.52475643, 0.43194502, 0.29122914])
What happens when we set the same seed again? We reset the seed and we get the same sequence of numbers as above. This makes deterministic pseudorandom number generator.
np.random.seed(42)
np.random.random_sample(10)
# Output
array([0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864,
0.15599452, 0.05808361, 0.86617615, 0.60111501, 0.70807258])
Commenting
A code block with 10 lines of pandas code can be most probably rewritten in 5 lines by a pandas expert, but code understandability suffers. We tend to do the same as we get better and better with the tool. We may know what the code does today, but will we remember in a month? Will Junior Analysist know what is this mumbo jumbo?
Probably not! When the code gets complex, you should write a comment or two. Not just when doing data analysis with pandas, but when coding in general.
Safety checks
Write safety checks instead of comments, like "This part of the code doesn’t support Null values or duplicates". It will take the same amount of time, but the user will notice safety checks for sure as they will break execution in the case of a problem.
Let’s look at the example below.
df = pd.DataFrame(index=[1, 2, 3, 4, 4], data={"col1": ["a", "b", "c", "d", "d"]})

The Dataframe has duplicated index value 4. We can use duplicated function to detect duplicated index and then break execution with assert statement.
assert len(df[df.index.duplicated()]) == 0, "Dataframe has duplicates"

Format the code
Jupyter Notebooks are notorious for unformatted, ugly code. The main reason for this is that early versions didn’t have code formaters. After they did, it wasn’t trivial to install them. But this is not the case anymore.
I use jupyterlab-code-formatter with JupyterLab and it works well. Here is the installation guide. Let me know in the comments if you need any help installing it.
Properly formatted code will increase the chance you won’t throw it away and start over.
Output the shape of a DataFrame
I find it a good practice to output the shape of a DataFrame after each transformation. This helps to spot bugs when there is an incorrect number of rows or columns after reading, merging, etc. This way we can find mistakes by only reviewing the outputs of the shape function without rerunning the notebook.
Let’s inner join the example DataFrame with itself. It has only 5 rows so we expect that the new DataFrame will also have 5 rows.
df.shape
(5, 1)
The new DataFrame has 7 rows instead of 5.
df_new = df.join(df, lsuffix='_l', rsuffix='_r')
df_new.shape
(7, 2)
We see that the problem occurs because of a duplicated index. We spot a bug right away by using the shape function.
df_new

Asking reproducible questions
There are cases when we have the input data, we know how the output should look, but we don’t know how to write steps in-between (we’ve all been there). The next step is usually to ask a question on Stack Overflow or Reddit or ask a colleague for help.
We can make collaborative problem solving much easier by writing reproducible questions:
- describe the core of the problem concisely and don’t dive too deep by copy-pasting half of the experiment,
- use small DataFrames that can be initialized in single line (don’t reference local datasets),
- when using slack wrap the code in the code block with “`.
A good practice is to make DataFrame output copy-paste friendly. Never heard of it? Let me explain with an example.
df
# Copy-paste output
col1
1 a
2 b
3 c
4 d
4 d
Pandas has a read_clipboard method which does as the name suggests. We can copy the output from above and run the command below and it will replicate the DataFrame. Try it for yourself. We can also change the separator to tabs t
or any other separator if necessary.
df = pd.read_clipboard(sep='ss+')
df

This copy-paste procedure doesn’t work with MultiIndex. So try to avoid them when asking questions.
Conclusion
These were few tips to make your pandas experiments reproducible. Did you enjoy the post? Let me know in the comments below.