It’s time to structure your data science project

A simple template to structure your data science project

Badr MOUFAD
Towards Data Science

--

[Illustration by author]

Why notebooks are so popular within the data science community?

When you dive into the field of data science, you quickly notice that Notebooks are the communally used tool to work on and share data science projects.

To some extent, it is a well-justified and nicely-founded choice. Notebooks bring together code, graphics, and text in a single-fast interactive “ecosystem”.

let’s take a step back,

To run a python line of code, I could say there is nothing fast than executing it directly into the console. However, this becomes inconvenient at the moment you have to run multiple lines of code. Just imagine how awkward and impractical it will be to define a function in a console.

On the other hand, python scripts are a convenient way to create and execute long code. Yet, it is also impractical in the context of data science because it simply does not scale well. Indeed, whenever you want to write a piece of code you will have to create a python file. So, take a while to think about the number of notebook cells created in one data science project. Certainly, it will give you an idea of how many python files you will create and how it will be challenging to ensure their order of execution.

In short, Notebooks provide a way to run python code as fast as it will be in a console while giving the ability to write snippets of code. Consequently, they extend the console-based behavior and add interactivity to the mix.

Personally, I find that these are the core features that enabled Notebooks to gain such tremendous popularity among the data science community.

The “One notebook structure” works well until …

As a data science student, I used to submit my assignments and mini data science projects in a form of a notebook. Besides the advantages I mentioned above, it was convenient to structure my work in a single notebook since I was able to embed graphics and inject markdowns to write text/equations, and therefore elaborate further my reasoning.

It was until I got involved in a research internship that I realized the limits of this “one notebook structure”. In particular, I noticed that this method doesn’t scale well when projects get bigger and bigger.

The moment I started fetching, cleaning, and exploring the data, the number of cells exploded exponentially, and I got quickly overwhelmed by the huge number of variables. Also, when trying new approaches, I noticed that I was constantly writing repetitive code. In the end, to avoid ambiguity between variables and that some notebook cells do not influence others, I toggled into comments some (many) parts of the code. Honestly, I usually ended up shutting down the notebook kernel and restarting the environment.

In short, it was impractical to code an entire project in one single notebook where one fetches, explores, and cleans the data then setups, trains, and evaluates models.

It will be even a nightmare if one wanted to spot a bug or decided to change something in a terribly long and extremally chaotic notebook full of blocks of code toggled into comments.

From this unfortunate experience, I released that the “one notebook structure” is not the choice to opt for when dealing with big projects.

[meme by author]

Google it before taking action

I believe that Data Science is at the interface between software programming and applied mathematics (probability, statistics, optimization, …). When working on data science projects, I often realize that I am coding more than anything else. Cleaning the data requires code, understating data require making visualizations which at its turn requires code, …

And as I coded, I developed the so-called “google it” reflex. So, whenever I forget/doubt something or get stuck on an error, I simply google it.

“Some say a software engineer is just a professional google searcher”, Fireship, how to “google it” like a senior software engineer.

Personally, it is one of the best lifesaving habits to develop that will release you from loading your brain with detailed packages’ documentations. It is better to learn how to google things efficiently rather than remember them by heart.

Following that, I browsed the internet to look for a solution to my problem, particularly a project structure to use in order to organize and thereby be able to scale up my work. Obviously, I found what I was seeking. I found a cookie-cutter template project: “cookiecutter data-science”. It is a cookie-cutter template that aims to provide a standardized data science project structure.

The Motivation behind a standard project structure

I do admit that my unfortunate experience showed me how desperately I needed a project structure. Yet, after scrutinizing the template description and its GitHub repository, I was more convinced about the importance of standard structure and aware of the advantages it offers for both “you and others”.

On one hand, it enables an efficient organization of thoughts and code and hence speeds up the workflow. So, you won’t get lost in the “middle of the road” or stand speechless in front of a notebook wondering “What the hell was I thinking”.

On the other hand, it ensures code shareability and reproducibility. So, others will be able to jump quickly and directly to parts of code they are interested in. And more importantly, they won’t get into trouble while re-executing your code.

At once, I thought I solved my problem and all I needed is to adopt this project structure. However, it was not easy as I thought. Indeed, it was so difficult to get familiar with the project structure which contained some files and folders that I totally ignored their purpose —especially for a beginner like I was.

In the hope of adapting this project structure to my use case, I spent hours of reflection as well as experimentation to understand this structure and adjust it according to my needs. In the end, I resolved to the following structure: “Simple DS project

The “Simple DS project” structure

In this section, I will detail every component of the “Simple DS project” structure. Also, I will emphasize the purpose as well as the motivation behind each one of them.

“Simple DS project” structure[Illustration by author]

The “data” folder

A fundamental objective of data science is to derive insights from data. Therefore, it is an essential ingredient.

The purpose of this folder is to gather all the project raw data. On the other hand, it will also serve as a “bucket” where one saves its preprocessed data to avoid repeating the same operations on it every time.

The “notebooks_exploration_cleaning” folder

This folder will contain all notebooks related to exploration-cleaning of the data. But why mix exploration and cleaning in one folder and not split them into separate folders?

In every project I have worked on, I have never been able to explore the data independently from cleaning it (or vis-versa). Indeed, it was during the exploration phase when I resolved how to treat the data. For instance, it is after visualizing the proportions of missing values in the data columns that I decide how to handle them (e.g. remove them, or replace them by the median, …).

Finally, after exploring-cleaning your data, make sure to save it in the “data” folder so that to use it later when building models.

The “notebooks_models” folder

As its name suggests, this folder is dedicated to notebooks where one builds, trains, and evaluates its models. I highly recommend putting each model in a single notebook.

A typical structure of this folder would be, model_1.ipynb, model_2.ipynb, … or if you want to give meaningful names to your files there is nothing better than naming them by the name of the model, for example, linear_regression.ipynb, lasso.ipynb, ridge.ipynb, elasticNet.ipynb

The “py_scripts” folder

Let’s imagine the following (recurrent) situation,

In exploration_1.ipynb you wrote a code to visualize a data column. Later, while trying another approach, you needed the same code in exploration_2.ipynb to make a similar visualization, so you copy-pasted that code to reuse it. In exploration_3.ipynb, you needed this visualization another time, so you copy-pasted that thing again, …

One major drawback when working simultaneously with multiple notebooks is writing repetitive code. The main purpose of the “py_script” folder is to overcome such a problem. Indeed, it will serve as a python package where you put all the repetitive code. So, whenever you notice that you are copy-pasting a code multiple times, just rewrite it as a function, put it into a python module, and voilà! Whenever you need it, you will import it directly, the same as you are importing a usual python module.

The “README.md” file

The README.md is a markdown file meant to describe your project. It is here where you will set the context of the project, mention its purpose, and state the guidelines to reproduce its findings.

When browsing projects, I always glance first at their README before diving deeper into their content. Therefore, always keep in mind that this file is at the forefront of your project.

The “.gitignore” file

When working on a big project, it becomes mandatory to use version control software, especially if the project has more than one contributor. Git combined with GitHub are one of these that enable easy and efficient management of projects.

Troubles when not using version control [meme by author]

In particular, the “.gitignore” file contains the name of all files and folders that should not be tracked by Git and hence won’t be synchronized with your GitHub repository. Typically, it contains the name of cache and build folders. Besides, you may include in it the “data” folder especially if you are using large datasets that exceed 100MB.

Aside from that, if you are not familiar with Git and GitHub (and frankly you shouldn’t), consider spending a while to learn about them. Certainly, it will make an inflection point to your programming life.

The “environment.yml” file

The use of third-party packages such as numpy, scikit-learn is omnipresent in data science. So, how can you tell others about the dependencies they are required to install to run your code?

The “environment.yml” is meant to answer this question in the sense that it contains all the python packages that your Jupyter kernel needs to run code or notebooks.

In addition to that, this file is easy to create and use. To export your conda environment just cd to your project directory using the anaconda prompt and run the following command:

conda env export > environment.yml

By adding this file to your project, you ensure that others won’t find trouble executing your code, and more importantly, you guarantee the reproducibility of your findings. Indeed, others will only run the following command to replicate your conda project environment

conda env create -f environment.yml

connect to it and thereafter run the code.

Getting Started with “Simple DS project”

Generate the project structure

It is easy to get started with “Simple DS project” structure. All you need to do is to install cookiecutters — a python package that creates projects from template projects — by running

pip install cookiecutter

Afterward, execute the command below and follow the prompt to create/set the project and its name

cookiecutter https://github.com/Badr-MOUFAD/cookiecutter-simple-DS-project.git

With that being achieved, the “Simple DS project” structure will be generated on your local machine and will be ready to start working…

Note on Integrated Development Environment (IDE)

When working with this project structure, you will notice that you are constantly moving from one notebook to another, from a notebook to a python file, … which I find a bit restrictive.

IDEs are tools that alleviate such a restriction in the sense that they enable juggling easily between the project directories and files. In fact, you have all the visibility on the project through the directory tree shown in the sidebar. In addition, you can modify files while staying in the same environment.

In my case, I use VS code combined with Jupyter extension.

Recap and conclusion

“Simple DS project” is a template inspired by “cookiecutter data science”. It provides an entry-level structure to organize your work when dealing with a “bit bigger project”.

Also, it is not a structure to be literally followed. You can adjust it and refine it based on your needs. Honestly, I constantly alter it according to the project I am involved in by adding/removing files and folders.

Remember, the way you organize your work keeps evolving as you work on projects. In my case, I noticed that I recently started to converge more and more to the initial structure — “cookiecutter data science”. When I rethink what prevented me from directly adopting it, I find out that it was not the complexity of the structure itself, but rather my unfamiliarity with working with a project where I manipulate multiple folders and files with different extensions (python, notebooks, markdowns, …). I trust that “Simple DS project” template will make this transition smooth for you.

Finally, to view a live version of the template, you can visit its GitHub repository. There, you will also find the source code.

--

--

Research Engineer @Inria (OCKHAM team) | OSS advocate | ML/CS/Maths enthusiast