Notebooks, such as Jupyter or RMarkdown, have been one of the most adopted tools for data scientists for a long time now.
With a notebook, you can organize a data-related solution and have a nicely formatted description of your work. A clear advantage is having the possibility of directly documenting every aspect of a project using Markdown, alongside every piece of code.
However, not everyone shares the notebook in a way that helps others correctly reproduce the results. And we all know that reproducibility is a primary concern in any Data Science endeavor.
Below, I will talk about the 8 main points you have to consider to make sure your notebook is easy to follow and will be highly appreciated by any reader and yourself in the future.
1. Do not perform too many operations in each cell
Having code cells in a notebook with many operations together is a bad practice. It leads to less interactive results (only the last line will be printed) and might hide some details (the user may need more time to understand what each cell is doing).
Notebooks have practically unlimited cells. So instead of stacking up a lot of code in a single cell, make sure to create new cells freely to explore your data.
I’m not saying you should have one-line cells. Code that makes sense together should be in the same cell. Use common sense when deciding when to create a new code cell.
2. Have a clear structure with organized sections
The structure of your notebook matters more than you think.
With a clear structure for your notebook, readers will be able to:
- Understand the overall idea easily; and
- Go directly to the part of the project they’re most interested in.
So, aiming for simplicity (KISS principle), we can use the following structure:
1. Introduction 2. Data Wrangling 3. Exploratory Data Analysis 4. Modeling (optional) 5. Conclusion
Each of these five sections has a few key points they should always include, generally organized into subsections. Notebook structuring is a crucial topic, often underrated. For a complete discussion, see the following post from my former colleague Ryan:
How to Structure your Data Science Notebook to be Easy to Follow
3. Annotate every aspect of your plots
Plots should have annotations everywhere. The goal is to make sure our plots are fully descriptive. In other words, they should be stand-alone, "readily interpreted".
For each visualization provided within the notebook, we have to ensure that they contain at least the following:
- Labels on the x-axis and y-axis, with text that properly describes the data, instead of using variable names.
- A title explaining what is depicted.
- A legend, if necessary.
- Units for each axis, e.g., "Temperature (Celsius)" on the y-axis label.
For a nice tutorial on annotating plots using python, I recommend the Matplotlib tutorial.
4. Abuse of Markdown cells
To truly take advantage of notebooks, we have to fill all gaps in understanding the code with Markdown cells. The idea is to explain every step, decision, and result.
While comments in the code cells help explain the code, Markdown cells should help the reader understand the overall goal of each step.
To create a Markdown cell, just hit ESC and then hit "m", this will make a code cell into Markdown. To learn more about this absurdly simple language, take a look at this cheat sheet – it’s all that you need to use Markdown.
5. Make sure there are no errors from "future" cells
Sometimes during the analysis, you add code to cells and execute them, and then, after that, you modify and run another cell that comes before them. This may obviously cause some inconsistencies.
For example, using variables defined in cells below the current cell will produce errors. See the straightforward example below, where we create a DataFrame df_cell_3
on the third cell, but running the code top-down, we try to access df_cell_3
on the second cell before it has been created:

Therefore, every time you finish some part of your project, restart the notebook and run all the cells in order (from top to bottom) to check for errors. Having the code cells executed in order before going forward will ensure everything is well set and no "future" variables/functions are used before their creation.
If you always restart the notebook to run cells in order, the numbers between the brackets on the left side of the cells will always be in increasing order. Having these numbers in a sequential order further shows everyone that the results are valid and well organized.
6. Functions are your friends
Unfortunately, using repetitive code in notebooks is still a widespread practice. I mean, it is just too easy to simply copy and paste code among cells, right?
But instead of using boilerplate code and pretending that code repetition is not a problem, we should create functions as much as possible.
By defining functions, we reuse code and adjust parameters when needed. This gives us the first draft of modular code that could be used in a real-world application in production.
Additionally, when using functions, remember to include docstrings, a universal convention that supplies all of the maintainability, clarity, and consistency for the code. It enhances code reusability and future maintenance possibilities. Follow this link to learn more about docstrings.
7. Double check packages and list them early on
A notebook is usually built after many iterations, trying different packages and functionalities. In the middle of the analysis, you often add a package and use it. When this happens, we often have a import
command in a code cell in the middle of the notebook.
So regarding importing code, a good practice is to identify all of the packages used and list them in a code cell at the beginning of the notebook. This way, the use of each package is immediately evident when opening the notebook.
More importantly, this also helps when transferring the code to production-level scripts, as the requirements file with the packages needed by the notebook will be more easily identified.
8. One focus
It is tempting to write code to solve many problems in a single notebook. However, doing this might be confusing for a future reader, which will probably be yourself in the future. So always define only one goal per notebook. You will know clearly what to expect, and your work will be more organized.
Of course, focusing on one thing at a time will increase the number of notebooks you have. But having more two or more notebooks is not a problem. It is always better to create multiple notebooks than overload one notebook.
(BONUS) Export your notebook
If you intend to share your .ipynb
file, remember to also export it to a.html
file to make it easier for others to view the contents.
To create an HTML file, you can add a code cell in your notebook with the following code for a notebook entitled My_Notebook.ipynb
:
Thanks for reading. Please let me know in the comments below if you have more best practices to create professional notebooks.
If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member. It’s $5 a month, giving you unlimited access to stories on Medium. If you sign up using my link, I’ll earn a small commission.