The Effect of Naming in Data Science Code
Even though there are tools allowing to practice data science without coding, they are far from sufficient. Data scientists will be writing and reading code. Reading code that has poor readability is a horrible experience. This post focuses on the importance of naming entities (e.g. variables, functions) and how easily it improves the quality of your code.
“There will be code”
“… I also expect that the number of domain-specific languages will continue to grow. This will be a good thing. But it will not eliminate code.”
wrote Robert C. Martin in the first page of his book Clean Code: A Handbook of Agile Software Craftsmanship, in the first chapter called “there will be code”. We, data scientists, write and read code. Even though there are helping tools, we will be still writing and reading code. It is one of our core practices. This is how we analyze the data, train models, predict outcomes and many more. I strongly believe that there is no escape from code for a data scientist.
We can still practice data science without coding, at some level. From early days to today there were graphical tools that allow analyzing data or practicing machine learning. One example of such tools is WEKA. WEKA is a bundle of machine learning tools with a graphical user interface. According to Wikipedia, its development started in 1993*. It allows users to conduct machine learning experiments and more without writing a single line of code. Then why I insist on saying writing and reading code is fundamental to data scientists? Because there will be custom operations.
Graphical tools have only so many operations. If you are not doing same tasks everyday, there will be a time when tools will not have the operation you need. This can be an analysis, a machine learning model, or some other operation. You need control, customization, and expansion over your operations at some level. The level depends on the task, libraries, or tools. As many problems require new or custom approaches, it is quite soon that you will overgrow those graphical tools.
It is seldom that we work alone. Often organizations do not have a single data scientist, they have data science teams. Even if it is not the case, data scientists work with other disciplines. This collaboration requires good data science code quality.
Moreover, if you are working for a company, your colleagues from other teams will need your code. In order to embed your models into backend, frontend or another system, you will need code. A model that cannot be deployed or integrated would be useless for your company.
If you agree that there will be code in data science, let us talk about how to write good code for data science. Writing good code is a hard task. There are well written texts explaining why it is necessary and how to achieve it. In this post, my goal is to focus on a tiny bit of that. The bit that requires no training or education, but will improve your code quality significantly. That bit I will focus on is naming, and it will improve the readability of your code.
One Easy Trick: Renaming
It is a horrible experience reading a piece of code that has poor readability. Let us look at this simple example below:
import pandas as pddf = pd.DataFrame({"f1": [42, 12, 5, 8, 15, 65],
"f2": [172, 155, 110, 120, 158, 168]})
df2 = df[df.f1 >= 18]
out = df2.f2.mean()
# out: 170.0
Try to guess what this code does. Author of this code knew the goal of each line, and the goal of entire code while she was writing it. However, as a reader you see a code snippet you need to decipher. As the author did not pay attention to readability, you will spend more time and energy trying to understand this code. Furthermore, you will be more prone to make mistakes. Let us dissect this example to see why it is bad:
import pandas as pd
I know importing pandas as pd is very standard these days. All the data scientists who work with pandas will understand this shorthand. It is not the biggest problem in the code, but I believe that this can be improved. Addition to that, short (e.g. 2 characters) variable names are troublesome in autocompletion (e.g. pd vs pdb).df = ...
Again, probably because of the pandas tutorials, it is a wide applied practice to name a DataFrame df. However it is hiding what the data is in this context. What kind of data is that?"f1":... "f2":...
These are the columns of our data frame. However it is not informative. What kind of data those columns have? Why they are enumerated as 1 and 2? Does enumerating have a purpose (e.g. first and second of something), or not?df[df.f1 >= 18]
What is the meaning of 18 in this line? Is it some kind of magic number? Why are we filtering larger than or equal to 18, and why are we filtering on f1 column?out = df2.f2.mean()
Same problem all over. What is the significance of column f2? Why are we taking the mean of it?
Now let me rewrite it in a more readable way instead of explaining the purpose of it. I will just rename the variables and keep the rest of the code same.
import pandasphysical_data = \
pandas.DataFrame({"age": [42, 12, 5, 8, 15, 65],
"height": [172, 155, 110, 120, 158, 168]})adult_physical_data = physical_data[physical_data.age >= 18]
mean_height_of_adults = adult_physical_data.height.mean()
# mean_height_of_adults: 170.0
With just renaming, you understand the goal of the snippet and the operation in each line only with a glimpse. I believe that explanation of the code is not necessary now as it is very clear.
data = …, But Which Data?
I believe that naming in data science code can be harder than a generic software code. We have less concepts that fit into object design, more abstract entities, and diverse collections. This makes naming harder in data science. As an example, think about working with table that has lots of columns. For instance a table that has columns about customer as person (e.g. name, age), customer behavior (e.g. purchases), geographic information (e.g. address of the purchase), and temporal information (e.g. time of the purchase). How would you name this table?
You can come up with different names for this table but there are common bad ones. For instance do not name it dt
or df
. Even though you will see df in every place in pandas documentation, you must understand that they are short example codes that do not belong to a project. Also do not name it data
. Yes, it is data, but would you name a variable that holds the age “integer”? “Data” as a name is very vague. Which data is that? What kind of information does it hold?
Do not be afraid to use long names. Longer names usually carry more information about the entity. Having longer names is not a burden. There are many decent IDEs with auto-complete features, like PyCharm, so you do not have to write the full name.
When coding, think about the next person who will read your code. Will she understand the main goal of the script and how each line serves to that goal? More specific to naming: will she capture the meaning of that entity quickly by just the name?
No Excuse for Bad Code
Using better names is not only for your teammates or only yourself. Both your teammates and yourself benefit from this habit. From your teammate’s perspective, she will read the code you wrote easier and faster. From your personal perspective, you will write your code easier as you will have less cognitive load memorizing the meaning of your entities. As a result, your team and organization will benefit from this habit.
The habit of naming better might seem hard to build first. You may not want to spend your time on finding better names. However this is a habit that pays back. You should practice it even if the code you write is a prototype, or part of a tiny project. Without knowing, such code may turn into a bigger project. There are many cases where projects that supposed to be very small, but development kept going for years; projects that designed as “fire and forget” ended up being very important for the organization. This is why even your shortest code should have good naming.
If you never thought about better naming, I hope after reading this you will try naming your entities better, and see how it improves the quality of your code.