This is the second article in this series. For the list of the article in the series, please check the section Previous Articles.
Introduction
‘Code is read more often than it is written’ is one of the famous sayings among software engineers. When we write code irrespective of the objective (Data Science or not), language (R, Python, or Scala), or nature of the project clean code style makes the developer’s life easy. I personally believe in the CRUM philosophy, easy to Collaborate, Read, Understand, and Maintain. How to lead your team to a clean code army is a challenging task. Here comes the importance of coding standards. A well-defined standard, a customized coding standard followed by a team of engineers and data scientists, always ensures that we have maintainable software artifacts. But in the world of a puddle of notebooks and .rmd files where is the space for coding standard? The current article tries to discuss the very topic of coding standards and how to build your own standard.
Coding Standards
Coding standards are agreed with guides of best practices, which a group of developers deiced to follow. Agreed with the guide tells how we must write the code so that the project code is consistent across the developer. All mature software development and product teams always create and maintain such a standard as reference. Since multi-technology and multi-language development started, it was more of a requirement since people switch between languages. Data Scientist uses the exploratory (visual) and development support in notebooks and like technologies. In this world, the focus is more on ‘model’ than the code until it reaches deployment. In this article, we will examine the case of Python and R.
Python and PEP8
The Pythonista, in general, would love to follow the PEP-8 coding standard. When Guido van Rossum created Python, readability was one of the fundamental design principles. In general, Python code is considered readable. But the simplicity and powerful batteries can make it hard to read simple code. PEP-8 exists to improve the readability of Python code. Most of the time, experienced Python developers craft a custom Python guide for the teams. Key focus areas (in general) includes:
· Naming Conventions
· Code Layout
· Indentation
· Comments
· Expressions and Statements
· General Programming recommendations
This will be a starting point. There are a couple more cases to handle for the Data Science projects and Software project in general. The majority of the Python-based projects are now using Python3.X. Type hinting is one of the critical features in P3.x. It is advised to be part of the coding guidelines. One of the typical [patterns and anti-pattern in the Data Science project is the use and abuse of lambda functions. Most of the time, lambdas make the debug process in production. The use of lambda regarding Data Frame manipulations and the purpose of traceability and debugging should be enforced.
Framework driven patterns, such as Pandas and other frameworks, needs attention. One of the examples is using SQL data types in the to_sql API. Most of the time, a long dictionary is directly typed inside the to_sql. A variable can manage this pattern for simplicity; more over-specifying the data types is a one-time requirement.
A good starting point to craft your team’s Python coding standard is Google’s Python coding guide [1]. Another useful reference is ‘How to Write Beautiful Python Code With PEP 8’ by RealPython [2].
R and Code Style Guide
Before the sexiest job title ‘Data Science,’ R was the favorite programming language for Statisticians, Data Miners, and ML researchers. There was significant adoption of R in enterprise before the winter. Due to the very nature of non-software professionals writing code, the coding standard was not enforced widely. Still, when enterprise-wide adoption started, some standards were there. The three standards are the Tidyverse style guide [3], Hadley Wickham’s guide [4], and Google R style guide [5]. If we rely heavily on RShiny applications, it is better to consult with the UI/UX team before closing the standard document.
IDE and Notebooks
In software engineering, an IDE plugin will always serve as a virtual assailant to maintain coding standards. Bu the Notebook is not similar to the IDE environment. When we are onboarding Data Scientist and ML engineers, it is advised to prove an orientation on the coding standards. We should include team-specific best practices as well as reusable components.
Most of the time, most notebooks will be filled with procedural code (no functions class, etc..). The code may be repeated several times; exploratory data analysis is a typical case for this. It is better to create reusable functions. This will eliminate copy-paste errors and hair-pulling in a highly iterative model building environment. In the long run, we may end with making a good collection or library to use across the enterprise.
Tools
There are some excellent tools to help in formatting and checking the coding standards. Flake8 is my favorite tool, and I use it along with VSCode. When coding with R and RStudio, I prefer to install ‘lintr’; along with this, I make sure that the editor is configured for static code analysis.
During the project code review, depending-up on the team experience, and I set-up an acceptable threshold for the lint score. So it is not always 10! If you are starting to enforce the standards, it is better to start from a possible point for all and gradually raise the bar of expectation.
Next Steps
So far we discussed coding standards and the pointer to get started with coding standards for Data Science and Machine Learning projects. Practical implementation in a diverse, skilled Machine learning/Data Since the team is not easy. One may have to overcome the hurdle of resistance by re-educating very patiently. In the forthcoming article, we will discuss the Test-Driven Development for Model Builders.
Happy Model Building !!! Previous Articles
[1] Software Engineering for AI/ML/Data Science Projects – https://medium.com/@jaganadhg/software-engineering-for-ai-ml-data-science-projects-bb73e556620e
Reference
[1] Google Python Style Guide, https://google.github.io/styleguide/pyguide.html
[2] How to Write Beautiful Python Code With PEP 8, https://realpython.com/python-pep8/#why-we-need-pep-8
[3] The Tidyverse Style Guide, https://style.tidyverse.org/syntax.html#control-flow
[4] Advanced R by Hadley Wickham, http://adv-r.had.co.nz/Style.html
[5] Google’s R Style Guide, https://google.github.io/styleguide/Rguide.html
Originally published at https://www.linkedin.com.