Why Data Scientists Should use Jupyter Notebooks with Moderation

Jupyter notebooks were game changers for data scientists across the globe. But should it be used indiscriminately?

Maurício Cordeiro
Towards Data Science

--

Photo by Elisa Ventur on Unsplash

Introduction

It’s no doubt that the launch of the project Jupyter, and its notebooks, back in 2015, changed the relation between scientific programmers and their code. The first reason is the simplicity to connect to different programming languages (kernels) and combine text with code snippets and outputs such as tables and graphs, and maps into a single page. This notebook feature made it possible, and simple, to implement the literate programming paradigm, first proposed by Donald Knuth in 1984.

Literate programming was first introduced by Knuth at Stanford University, with the objective to bring the program logic closer to the human language. It combines code with natural language texts.

The second reason is the interactive nature of Jupyter Notebooks. The possibility to experiment with data and see the code’s results for each typed command makes it ideal for data scientists and researchers, where the focus is on data analysis and not on developing.
By using interactive notebooks, it is no more necessary to write a long script, filled with dozens (or hundreds) lines of code, prone to errors, to display the results just at the end of the processing. Depending on the objective, you don’t even need to bother with declaring functions or designing classes. You can just declare your variable on demand, and focus on the results.

Bottom line: Python and Jupyter became a standard for data scientists. And this can be confirmed by the increasing number of available courses and jobs positions that require these skills.

But now, you may be asking yourself. If it is so good (and a gamechanger), why do I have to take care of its use?

To answer this question I will tell a little story.

Old fashion programming

When I started to develop my research at the university, I was at least 10 years apart from any coding and I barely know about the existence of Python. I used to code in Pascal, C, and little of Fortran that were the main scientific languages used in the Universities when I graduated (I know, it was a long time ago). I didn’t know either about the existence of the Jupyter or any of those thousands of different Python packages, and it can be overwhelming.

Photo by bert b on Unsplash

So I started the way I was used to. I bought two Python books (yes, yes… I still use books) and installed the basic Python interpreter and a good and free IDE. A quick web search pointed me to the Pycharm community version.

As I didn’t have the benefits of the quick visualization provided by Jupyter, I’ve created a pipeline to preprocess all the input data and to test different processing combinations. In the end, it generated all the possible graphs and outputs I needed for my research. I was forced to write a good code, that was easily reproducible, otherwise, I would not be able to analyze everything. As I was working with high-resolution satellite images, the amount of data was huge.

It took me some time to develop everything, but once the work was done, I could focus on experimenting with the algorithm in different areas of the globe, with different coverage, etc. In the end, I was happy with the results, and my first scientific paper and public Python package (a water detection software for satellite images) were published. You can check them in this GitHub repository here.

Notebook “Programming”

Having passed my first research “checkpoint”, I was open to learning new tools to improve my skills and then I installed Jupyter Lab (a newer version of the previous Notebook).

My life has changed. I remember thinking at the time… “why I didn’t try this before?”.

Photo by Myriam Jessier on Unsplash

I was astonished by the endless possibilities of testing, documenting, and quickly visualizing all the stuff I was doing. I even tested some recent packages that turn the notebook into a (kind of) development environment. This tool, called nbdev makes it easier to export modules, create packages and even document the whole stuff. The better of the two worlds, I thought.

However, after months of work on another topic and having achieved pretty good results with my machine learning research, my superior told me some words of fear: “Great results! Let’s try it on different sites to validate the results.”. Different sites? Validate results? For tomorrow???

I was not prepared for that. To achieve the initial results I ran a bunch of different machine learning tests, with different algorithms, different preprocessing normalizations, etc. But the focus was on the results, not on developing a complete processing chain, I was still “experimenting”. So, the code was not modularized, it was difficult to reproduce an old experiment, I could never find the notebook with the correct version of the implementation that worked, etc. etc. etc.

So, just reproducing the results for a new location is a real pain. And it takes a lot of time. And it makes us very inefficient. For the superior demanding, that’s not something easy to understand. He only thinks “… but you have already developed it, you’ve shown me the results, all I am asking now is to push the same button”. Well… kind of.

The truth is that after some time I had done a lot of different tests and experiments, and coding, that’s true. But I had no modularized code, ready for publication, or for sharing with other researchers. I had just that… a bunch of disconnected notebooks, with duplicated functions, weird names, etc.

In the end, it seems that I was not as efficient as before. I didn’t construct anything. I had no software to deliver. And this feeling was awful.

I have already written about the reasons why scientific software are not well designed in this story: 7 Reasons Why Scientific Software are Not Well Designed. And I believe that the indiscriminate use of Jupyter Notebooks by scientists “programmers” will make this problem even worse.

The insight from Kaggle

During this time that I was an avid notebook user, I also participated in some Kaggle competitions to improve my skills in Deep Learning (it is the best way to learn from other DL practitioners, in my opinion). One nice thing they always do after a competition finishes, is the interview with the winners.

So there was this interview, with a Russian guy (I don’t remember from which competition he was). He was asked about the developing environment he used, and he answered: “I don’t use Jupyter notebooks. All I do is through plain old IDEs”. That changed my mind. I was hearing that from the winner of a competition with thousands and thousands of competitors. Probably, most of them attached to their Jupyter notebooks until now.

That story made me rethink some misconceptions I had. The truth is that I was more inefficient with notebooks than I was at the start, using Pycharm (or Spyder, or VS Code, any other IDE).

What I want to point out here is: because of the liberty that is given to us by notebooks, it is necessary to double the commitment to keep code clean, reproducible, organized, etc. And, sometimes, it is just not feasible.

Image by author.

The solution?

Now, what works best for me, in my data science journey is to develop with the IDE and the Jupyter at the same time, but with different purposes. I write the functions and classes on the IDE, inside some new package I create, then I use the notebook just to call the package and visualize the results. This way, in the end, I have a “ready to go” package, that can be shared with other researchers.

For this setup to work, we need to pay attention to the following points:

  • Create a new (empty) package and install it in edit mode with pip (-e option). This way, the source code will remain on the original folder structure and you can continue to develop on it.
cd project_folder
pip install -e .
  • Use the %autoreload extension on the Jupyter Notebook. This will permit you to update the package on the IDE and check the result on the notebook without the need to restart the kernel.
# on the very first cell of the notebook
%load_ext autoreload
%autoreload 2
  • Optionally, you can attach the debugger of your IDE to the Jupyter kernel. On PyCharm this is done in the Run menu (image below).

I am currently working on a new mask processor and this is an example of how my setup looks like now. I have all the benefits of the IDE (completition, argument checking, etc.), the debugger runs normally and, in the end, the package is “ready” for deployment. Another advantage of using the Jupyter Notebook just for displaying the results is that it can be used as the user’s manual for the new package.

Conclusion

Just to be clear. I am not advocating against the use of Jupyter Notebooks or any other interactive environment like R, or Matlab. I understand their advantage, especially for research and data science works, where the focus is on data analysis and experimentation, instead of code production.

However, we must keep in mind what the expectations are. Normally, the most simple data analysis we make should be reproducible, and easily shareable with other colleagues.

If we use the notebooks just to take advantage of the multiple packages that already exists in the community and to display the results, that’s fine. But, for a new piece of code, a new kind of processing or even a simple automation of an existing process, it can be counterproductive in the end.

And you, which is the best environment setup for you, as a data scientist? Leave your comments and insights.

See you in the next story.

Stay Connected

If you liked this article and want to continue reading/learning these and other stories without limits, consider becoming a Medium member. You can also check out my portfolio at https://cordmaur.carrd.co/.

--

--

Ph.D. Geospatial Data Scientist and water specialist at Brazilian National Water and Sanitation Agency. To get in touch: https://www.linkedin.com/in/cordmaur/