Jupyter Best Practices That Will Save You A Lot of Headaches

These simples techniques can save your project a lot of time from revisions and data loss

Brunna Torino
Towards Data Science

--

Having been working with Jupyter almost daily for the past three years, I have experienced serious undesired situations that took me back days, or even weeks in my project completion time. These are simple things that I have learned to do to prevent serious data loss, time loss, and project phase confusion.

Photo by Joshua Reddekopp on Unsplash

Save copies of your dataset — especially if they are large, time-consuming imports.

If you have a 75MB file that takes 10–15 minutes to import to your working Jupyter Lab or Notebook, make sure to save it as a copy as soon as the import is done:

import pandas as pdreally_large_file = pd.read_excel('largefile.xlsx')copy = really_large_file.copy()

You can then choose to work on the copy of the original data frame or the original data frame (there’s no difference) but the important thing is that you will be able to go back to one of those original files if you make an irreparable mistake and only notice way down the line when it’s too late or too confusing to “undo cell operation”.

Save copies of your checkpoints as well.

If you have identified a rough number of significant phases of your project where you will change the dataset significantly, make sure to save it as another name by the of the operation, such as:

def some_operation_to_my_data(df):
# some operation
return df
new_df = some_operation_to_my_data(old_df)

This way you can always go back to your old data frame before the big operation if you need to cross-check any details or discard the operation completely.

What about memory?

There is an obvious trade-off here between saving copies and memory usage. If you find yourself with a heavy Jupyter file, always remember to delete those copies (as something called garbage collection) as you go along and validate them, and become certain that you won’t need them anymore.

del old_df
del really_large_file

Instead of saving copies from your checkpoints, you can also save them as files, freeing memory from the current Jupyter session:

def some_operation_to_my_data(df):
# some operation
return df
new_df = some_operation_to_my_data(old_df)old _df.to_excel('checkpoint1.xlsx')
del old_df

Label your cells, especially if you are still in the data exploration phase.

One of the most annoying things that can happen is to spend hours and hours investigating and understanding your dataset, and arrive the next morning or the next week, and not understand anymore what you did or what you concluded. Even worse, if when another colleague has to take over, and you can’t explain your work anymore.

Label parts of your Jupyter Notebook by creating a cell between them, where you will label what you are doing for this next phase. Also include a conclusion cell at the end, with a short summary of what you understood and any questions you have left.

## DATA VISUALIZATION PART : HIST, HEATMAP, CORRELATIONS...# CONCLUSIONS: NO SKEWDNESS, NORMAL DIST, MULTICOLLINERAITY...

Work Chronologically

This tip goes along with the previous point, but it’s worth mentioning and emphasizing by itself.

Sometimes, you want to do a quick operation on an earlier copy of your dataset, and you don’t think you will keep it in your file anyway — so you pick a random cell and write the code there. We have all been there, but try to form a habit of organizing cell operations chronologically and where they would be if you were to hand in the project today as it is, or had to explain what/how/why you wrote the code that way.

It saves you time formating everything for submission later on.

Opening Copious Amounts of Files

If you have a large folder containing dozens of files, it can be extremely tedious to import each one of them individually. What you can do instead, is transform that folder into a .zip file and open that in Jupyter:

import zipfile as zf
files = zf.ZipFile("ZippedFolder.zip", 'r')
files.extractall('directory to extract')
files.close()

In the same way, if you need to download several files from Jupyter you can also do it in one line:

import shutil
shutil.make_archive(output_filename_dont_add_.zip, 'zip', directory_to_download)

Source for this code: Afshin Amiri

Save Code On A Separate File

Have you written a full algorithm and now you realized you actually have to change most of it or scrap it entirely?

Open a draft file for every main file of your project, and copy-paste any unwanted code you wrote in there. It happens more often than not that you can use some of what you wrote previously on your new algorithm, or have to go back to see what didn’t work in your previous code. It occupies very little extra space and can save you a lot of time and confusion.

Interrupt Kernel and Undo Cell Operation Button

Did you write a forever-looping algorithm by mistake? Do you suspect your current algorithm is not going to output what you expect? You don’t need to close everything or delete the cell. Look for the Interrupt Kernel button on the Jupyter menu bar and interrupt the currently running cell to save your project from data loss.

Did you accidentally delete a cell, or made a mistake with your dataset? You can look for “Undo Cell Operation” in the Jupyter menu bar to bring back the cell you deleted, or undo the cell operation of your data.

Revert to Checkpoint

Jupyter will save checkpoints of your notebook from time to time, and if you realize you need to revert your whole file back to an earlier version, you can do that with the “Revert to Checkpoint” button. However, it is not the most reliable way of retrieving data and code as the checkpoint could have been three minutes ago or eight hours ago. I would always use the other tips aforementioned before this one, but it is good to know it exists.

Create a Custom Template

If you almost always use the same imports and packages in your projects, such as:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
[...]

You can save those repetitive imports as a Jupyter Notebook, and just click on Duplicate instead of opening a new, blank file. It saves you a little bit of time and proves to be very convenient in the long-run.

In Conclusion…

This is a non-exhaustive list of best practices I found to increase my productivity, efficiency, and also professionalism while using Jupyter. For me, the most important benefit of implementing these is that I can always go back to an earlier version of my project (be it in older code I thought wouldn’t work, or earlier versions of a dataset) quickly and efficiently, which in turn makes me more willing (and less hesitant to) take risks, make mistakes and try new ways of achieving my goals.

Do you have any other advice? Comment below and I will add them to the article!

--

--