When working on a Data Science project, I like to have a few ground rules for how I am going to write the code for my project. By having a few simple principles, I am always sure that the code I will write can stand a minimum quality threshold.
In this article, I will show you 5 simple development rules to orient your data science workflow.
The Rules
In a nutshell, the rules are:
- Abstract scripts into functions and classes
- Make sure functions are atomic
- Write unit tests
- Use Your Favorite Text Editor In Conjunction with Jupyter Notebooks
- Make small and frequent commits
Now, let’s go through each one by one.
1. Abstract scripts into functions and classes
Say you are working on a Jupyter notebook figuring out how to best visualize some data. As soon as that code works and you don’t think it will need much more debugging, it’s time to abstract it! Let’s look at an example,
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import numpy as np
import pandas as pd
synthetic_data = np.random.normal(0,1,1000)
plt.plot(synthetic_data, color="green")
plt.title("Plotting Synthetic Data")
plt.xlabel("x axis")
plt.ylabel("y axis")
plt.show()

Here, we plotted some synthetic data. Assuming we are happy with our plot, what we want to do now is abstract this into a function and add it to the code base of our project:
def plotSyntheticDataTimeSeries(data):
plt.plot(data, color="green")
plt.title("Plotting Synthetic Data")
plt.xlabel("x axis")
plt.ylabel("y axis")
plt.show()
Great! Now we can have this in our code base and call it every time we want to use this plotting function as:
plotSyntheticDataTimeSeries(synthetic_data)

Simple, and good practice.
If for any reason, we want to load the data and perform a couple of simple transformations to it, we might want to have a class that takes care of that for us.
Let’s look at a simple example using the big mart sales dataset:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import numpy as np
import pandas as pd
class DataPrep:
def __init__(self,data_path,cols_to_remove=["Item_Fat_Content", "Item_Type", "Outlet_Size"]):
self.data = pd.read_csv(data_path)
self.cols_to_remove = cols_to_remove
self.removeNans()
self.removeCols()
def removeNans(self):
self.data = self.data.dropna()
def removeCols(self):
self.data = self.data.drop(self.cols_to_remove, axis=1)
self.data = self.data.reset_index(drop=True)
data_path = "./BigMartSales.csv"
dp = DataPrep(data_path)
dp.data

And there we have it, now the DataPrep class can be used systematically to perform the transformations we want. This is clearly a toy example and does not cover the entire process of cleaning and preprocessing our dataset.
Here, we are only exemplifying how to abstract code into functions and classes that later can be evolved and integrated into our production pipeline.
2. Make sure functions are atomic
In conjunction with the idea of abstracting your pipeline into functions and classes, we should also always look for atomicity, meaning, every function should do one thing. This rule stems from good practices from object-oriented Programming and serves as a good guide to avoid unnecessary complexity that might cost a lot of time in the future.
So, if we have a situation where we load our dataset, transform it and then plot it, we should write a function for each atom of this process.
Let’s add a plotting function to the code base of the previous example to see how that would look like:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import numpy as np
import pandas as pd
def badExampleofNonAtomicFunction():
data_path = "./BigMartSales.csv"
cols_to_remove = ["Item_Fat_Content", "Item_Type", "Outlet_Size"]
data = pd.read_csv(data_path)
data = data.dropna()
data = data.drop(cols_to_remove, axis=1)
data = data.reset_index(drop=True)
data[data["Outlet_Identifier"]=="OUT049"]["Item_Outlet_Sales"].plot()
plt.title("Sales for Outlet: OUT049")
plt.show()
badExampleofNonAtomicFunction()

Still, here we are not constructing a reasonable track record of what we are applying to our dataset, so debugging this would be a nightmare.
Let’s now look at an example of how to make this code more easily debuggable. We can start by using the class we constructed previously and just write a simple plotting function outside that class:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import numpy as np
import pandas as pd
class DataPrep:
def __init__(self,data_path,cols_to_remove=["Item_Fat_Content", "Item_Type", "Outlet_Size"]):
self.data = pd.read_csv(data_path)
self.cols_to_remove = cols_to_remove
self.removeNans()
self.removeCols()
def removeNans(self):
self.data = self.data.dropna()
def removeCols(self):
self.data = self.data.drop(self.cols_to_remove, axis=1)
self.data = self.data.reset_index(drop=True)
def plotOutletSales(data, outlet_id):
data[data["Outlet_Identifier"]==outlet_id]["Item_Outlet_Sales"].plot()
plt.title(f"Sales for Outlet: {outlet_id}")
plt.show()
data_path = "./BigMartSales.csv"
dp = DataPrep(data_path)
plotOutletSales(dp.data, outlet_id="OUT049")

Now, we have something slightly better because we are making each step a building block that adds individual functionalities to our pipeline.
3. Write unit tests
Writing unit tests can definitely be a bit of an annoying step but for code meant for production, it is fundamental to ensure that our code will be robust enough for the real-world environment, as well as help prevent unnecessary bugs and issues when we deploy our pipeline.
Let’s write some code to test the individual methods we wrote for the DataPrep
class:
import unittest
import pandas as pd
import numpy as np
def removeNans(data):
data = data.dropna()
return data
def removeCols():
data = data.drop(cols_to_remove, axis=1)
data = data.reset_index(drop=True)
class TestDataPrep:
def __init__(self):
data_path = "./BigMartSales.csv"
self.data = pd.read_csv(data_path)
super().__init__(unittest.TestCase)
def test_removeNans(self):
data = removeNans(self.data)
result = []
for col in data.columns:
result.append(data[col].isnull().sum())
result = np.sum(result)
print(result)
self.assertEqual(result,0)
def test_removeCols():
data = removeCols(self.data)
cols_to_check = ["Item_Fat_Content", "Item_Type", "Outlet_Size"]
self.assertEqual(any(element in cols_to_check
for element in data.columns)==False)
data_path = "./BigMartSales.csv"
data = pd.read_csv(data_path)
if __name__ == "__main__":
unittest.main()
# Output
Ran 0 tests in 0.000s
OK
In this example, we test both functions by writing a class called TestDataPrep
that runs checks on their expected results.
4. Use Your Favorite Text Editor In Conjunction with Jupyter Notebooks
I am a huge fan of Jeremy Howard, and I definitely recommend his video:
talking about using Jupyter Notebooks for software development.
Development for data science projects can be hugely improved if we use our favorite text editors in conjunction with jupyter notebooks by having them sync automatically with this simple notebook command:
%load_ext autoreload
%autoloread 2
Now, every time we edit something on our text editor of choice we will have that function or class automatically updated in our jupyter notebook.
By doing this, one can leverage the amazing interactive features from jupyter notebooks along with the awesome text editing capabilities of our favorite text editor (I use Visual Studio Code, but this is just a preference).
5. Make small and frequent commits
Just as with functions and development in general, the goal is to build a pipeline that we can debug easily.
For this, one should also use small and frequent commits rather than big ones that would make it much harder to go back and investigate when something goes wrong.
My recommendation would be to set a rule for what makes a commit, and once that threshold is reached we commit and continue development.
To help with this, we can plan our project ahead and foresee the steps we need to perform to finish it, trying to find a clear definition of progress that, when reached, would trigger a commit.
In doing this, we get a method for calculated and properly tracked progress in a data science project.
On The Importance of Good Practices
All of those who work on serious data science projects need to have a method that is improved over time.
Many processes can be added or changed regarding individual good practices depending on the type of project and goals one has, but if there is one thing I want to leave you with is, define your set of good practices, follow it, and improve it over time!
If you are interested in data science and looking for a place to start, check out this udemy course:
This is an affiliate link, if you use the course I get a small commission, cheers! 🙂
If you liked this post, follow me on Medium, subscribe to my newsletter, connect with me on Twitter, LinkedIn, Instagram and join Medium! Thanks and see you next time! 🙂