Computational Notebooks and how to set them up

See if you can spot this one. It is Sunday morning. You have just been giving a huge dataset, let’s just say vast usage data for the last 6 months for one of your products. The dataset has one Excel file per day, with each XLSX being about 4MB in size. And you are told you have… four days to present your initial findings to the CEO. You see, it is not just data-scientists who deal with large amounts of data. On occasion, product managers do as well.
How would you go about executing the task? You need three resources: right people, right skills and the right tools. Let’s talk about tools first.
Method in Madness – Expository Data Analysis
Fact is, data-science projects have a certain cadence, a life-cycle if you will. While other more complex methodologies exist, for now, observe this reasonably simple flow from Harvard’s introductory course on data-science:

Consider the arrows here for a moment. Consider how they go back and forth like a data-enabled game of snakes-and-ladders (or వైకుంఠపాళి , as we used to call it in my childhood). Also consider the final extortation in the graphic: can we tell a story?
Data-science approaches sometimes can be non-deterministic. You will need to iterate till you get the results you’re looking for. And you will need to narrate it quickly. Because of this, there is a premium on rapidly executing basic tasks, while being able to explain those steps coherently. I would argue that, in many real-world cases, exploratory data analysis is actually expository data analysis: you not just explore datasets, but explain as well on the art of the possible or possibilities / anomalies etc in the data. This often can be tough, especially, if you struggle with version control or not having the right tools.
Computational Notebooks
Most data-scientists typically use Computational Notebooks for their data-exploration/ exposition. As this piece in Nature puts it, you need the equivalent of laboratory notebooks for scientific computing. Much like a genetic scientist (for instance) would paste, say, DNA gels alongside lab protocols, data-scientists "paste" data, write code, write explanations, generate graphs and other forms of visualisations, to document their computational methods.
This is a paradigm that drives coding, exploration and documentation. Using computational notebooks, data-scientists execute code, see what happens, modify and repeat iteratively, whilst documenting their thoughts, conversations with them (and their product managers!) and data. This builds more powerful connections within teams and between topics, theories, data and results.
Tools
Computational notebooks can be those in RStudio for those who prefer to use R. Some teams also use Databricks workspaces leveraging on cloud clusters and an ability to speak directly with data-lakes in SQL, among other languages.
The most popular computational notebook, though, is Jupyter, whose usage in GitHub repos has skyrocketed from 200,000 notebooks in 2015, to 2.5 million in 2018 (source) At this point, it is fair to say that Jupyter is the defacto standard tool for data exploration. Jupyter notebooks integrate exposition with computation, by having article-heavy features such as formatting for headings, bulleted lists and such, while also having the ability to embed code within text (or vice-versa, if you will). The clearest advantage, though, is that you can separate the execution of the code ("kernel") from the code itself. So you could type your code in a web-browser, but link it to a kernel that executes on a super-computer cluster somewhere.
Problems with Jupyter
Jupyter notebooks may be the most popular computational notebook since sliced bread, but they have their detractors. The most prominent of them is Joel Grus, who presented 144 slides to this effect at the JupyterCon in 2018.
A summary of his talk:
- State: For notebooks to work well, you must execute computational cells in order. If you don’t, things break. This is not immediately obvious to users.
- Not Modular Code: Jupyter notebooks don’t force users to write modular code.
- Not writing unit tests: Most data-scientists skip writing unit-tests or otherwise ignore principles of test-driven development. Using Jupyter notebooks encourages that.
- No IntelliSense/ AutoComplete: Most well-established software tools have functions that help you write code, through (what used to be called as) IntelliSense. Jupyter notebooks don’t have that.
- Incompatible kernels: You could easily write a notebook against one particular kernel, but execute it against another. This can cause significant confusion.
- Reading other notebooks: Reading notebooks made by others is often a pain, as you aren’t sure what version of the libraries they have used.
And to these, I will add one more:
- Jupyter kernels are disconnected from Jupyter’s shell: What this means is: the libraries you installed may not be the libraries the specific Jupyter shell is using. At its worst, you may have to shut down your local Jupyter kernel before installing the right version of the libraries.
In a nutshell, there’s a tendency for many Jupyter notebook users to disregard well-established principles of software engineering such as factoring, state management or testing. And as a result, that impacts the ability of data-science teams to quickly reproduce or replicate results.
The Solution
As a software architect in a past life, I couldn’t help but nod in agreement with Grus’ criticisms. At the same time, as a product manager, I see the value of Expository Data Analysis, the need to rapidly iterate and explain results without always relying on data-scientists. Or even if I do, I want to be able to read their work quickly.
Luckily though, there is a solution.
Step 0: Install conda
If you have already dabbled in Python, you’ll want to skip this one. The general approach is to install conda for most Python-related tasks. Conda’s regular installation installs everything – the latest Python libraries, Jupyter Notebook server, VS Code, Spyder and lots more – in one single setup. There are nuances in there, between 32 bit and 64-bit conda, between Python 2.7 and Python 3.x, and even between Anaconda and Miniconda. This article (among others) goes into depth. Make sure you have installed Visual Studio Code along the way.

But wait. Don’t launch Jupyter notebook yet! Look out for step 1.
Step 1: Create an environment
Depending on where you are, here’s where it starts to get heavy. This has been called a very challenging topic, but the right way to crunch in Python is by setting up a virtual environment. In fact, let me emphasise that: all good Python code-bases must must MUST have a virtual environment setup. There are many who ignore this at their own peril.
Why? Python is all about using multiple libraries simultaneously. This brings us three very specific challenges:
- Knowing what libraries to use: Even a simple analysis may typically require multiple libraries to be used. Without having an exhaustive list, you will miss out on something, thus breaking the code.
- Knowing what versions of libraries to use: The pace of development in the Python world is quite rapid indeed, so it is possible that the latest version of the library may not work with the version you’ve used.
This is particularly so when you’re aiming to share your notebooks over. These are the battle-scars my team and I had developed: calluses from opening a Jupyter notebook and realising that the otherwise well-crafted code-snippets that the team had so painstakingly developed over the past many months doesn’t work on your laptop. And it is already Sunday afternoon.
The right way is to (no, not panic) but to create virtual environments from start. Before you start a Jupyter notebook, you should create a folder for the project and then, create a virtual environment within.
Here’s how you do it:
- (Highly recommended, but feel free to skip) If you are on Windows,
- Install the all-new Windows Terminal via Windows Store.
- Add Anaconda PowerShell to Windows Terminal.
- (Bonus: prettify it using Powerline, Nerd Fonts, Cascadia Code etc. )
- If you are not on Windows or have skipped step 1, open up Anaconda’s prompt. If not, open up the Anaconda PowerShell in Windows Terminal. Navigate to the folder you have created.
Would strongly suggest creating an environment in a specific folder location. If RainInStraits
is your intended folder name (well, it is for me), you’d want to do the following (needless to add, you don’t have to type the comments starting with #):
❯ mkdir RainInStraits
#This generates the folder
❯ cd RainInStraits
#Navigate to the newly created folder
❯ conda create --prefix ./envs jupyter matplotlib numpy
The last command does something very interesting: it will create an environment as a sub-folder and install all the libraries I have mentioned there in that environment. This is good. Now you are ready to start your Jupyter notebook.
Step 2: Create a Jupyter notebook in… Visual Studio Code
Yes, you read that right. You want to fire up Visual Studio Code (you have installed it in the previous step, haven’t you?) first.
- Open up the folder you have created earlier:

- Then, hit
Control + Shift + P
(or the equivalent in Mac) This will open up the command palette. SelectJupyter: Create New Blank Notebook

- Almost there! Click on the kernel piece on the right side of the screen:

- Select the virtual environment you have just created. You will be able to select it based on its path.

- As good hygiene, do the same for the interpreter at the bottom left of your screen, and again, select the right environment:

- You are set!
Step 3: Using Jupyter notebook for Expository Data Analysis
This is where, perhaps, data-scientists may violently disagree with me. So I believe the key thing about a Jupyter notebook is that it is a notebook – it is a piece of text and should be treated as such. In other words, it has a heading, an executive summary, explanations and verbal content that should make it easy for non-data-scientists or coders to read.
Jupyter notebooks have the ability to embed Markdown text in cells, but where they really shine is when you make the switch from text to graphs. That they have to do via code is incidental – something that we shouldn’t forget.
Here’s an example of one of my notebooks, where I am exploring volatility of returns and such. Lots of notes – some math as well – and finally some Python code.

I’ve selected this virtual environment and executed that piece of code to get:

D’oh! I had forgotten to install the pandas package in the virtual environment. In the "regular" Jupyter setup, I’d have to stop the local server, install the package and then restart the server. But with virtual environments and Visual Studio Code, I switch over to the open Windows Terminal window, install pandas and switch back.

The code executes perfectly!

Features Galore

- But wait, I didn’t mention the best part. You now have… AUTO-COMPLETE.
- When you generate graphs even via Matplotlib, you can SCROLL IN and OUT EASILY.
- You can explore data productively with the right extensions:
- Excel Viewer: Read CSV’s and Excel’s in VS Code like you would in Excel:

2. SandDance: Visualizations on your CSV data:

- One more thing. (There is always a one-more-thing, isn’t there?) DEBUGGING.

You can look at all the active variables in the kernel in a single pop-up. You can not only check on individual variables, but you can also expand on data-frames. Filter, sort, play with them. WUT.

And yes, you can hit F10 in a single cell and execute the code line by line.

Nifty! And this, young padawan, THIS is how we roll.
Exporting Environments
Now that you’ve setup your environment and tool, you would want to share your work with others. This is where having dedicated virtual environments really shine. Before you share your *.ipynb file, you’d want to navigate back to your open Windows Terminal window, navigate to the envs folder you created earlier, and type the following (feel free to change the name of the file from environment.yml
to something more suitable):
❯ conda env export --from-history > environment.yml
When you share your *.ipynb file, be sure to share this yml file as well. That way, your recipient will be able to restore your environment at their end as well. They would need to enter the following at their command-line:
❯ conda env create -f environment.yml
Note: In theory, it may be possible to hit conda env export > environment.yml
However, that will list all the libraries that were installed as dependencies to the ones you had chosen. These dependencies can be OS-dependent and may not play well across all platforms. Adding a --from-history
restricts the libraries to only those that you have installed yourself. This allows conda to choose the dependencies specific to your recipient’s platform.
So there you have it, folks. With this set-up, you can:
- Monitor state: Line-by-line debugging and monitoring variables, yo.
- IntelliSense/ AutoComplete: It just works(tm).
- "Known" kernels: You have clarity on what kernels to use and how.
- Reading other notebooks: Reading notebooks made by others is often a pain, as you aren’t sure what version of the libraries they have used.
- Dependency Control: Have the right set of libraries updated to the right version at all times, including when you share the notebooks over.
To misquote Edgar Dijikstra, Jupyter notebooks may be considered harmful, if you don’t follow well-established principles of software engineering. But with the right tools, and the right discipline, they can actually aid your exploratory data analysis.
In the next few posts, I will further explain an even more productive way of sharing notebooks involving Docker containers, and will briefly touch upon the expository aspects of data analysis. For now, bask in the full glory that is a well-rounded toolkit to perform Expository Data Analysis.