
ADOPTING TECHNIQUES TO BECOME BETTER ENGINEER AND DEVELOPERS
Prepare for a production-level as software engineer and data scientist developer
Becoming a reliable software engineer and data scientist developer, and prepare for production level coding requires a few techniques.
- Writing clean and modular code
- Code refactoring
- Writing efficient code
- Adding meaningful documentation
- Testing
- Log
- Code reviews
These are all essential skills to develop and will help when implementing production solutions. Additionally, data scientists often work side-by-side with software engineers, and it is necessary to work well together. This means being familiar with standard practices and being able to collaborate effectively with others on code.
Clean and Modular Code
When a data scientist first starts to coding, they often struggle in writing code in a way that is clean and modular even though they have been coding for years. Practically, code could potentially be used in production when working in the industry. Production code is a piece of software running on production servers to handle live users and data of the intended audience, for example, using software products in a laptop like Microsoft Office, Google, or Amazon. The code running those services is called a production code. Ideally, the code which is being used in production should meet several criteria to ensure reliability and efficiency before it becomes public. First, the code needs to be clean. The code is clean when it is readable, concise, and simple.
Here is an example in plain English of a sentence that is not clean.
One could notice that your pants have been sullied, due to the pink color of your pants that appears to be similar to the color of a certain kind of juice.
This sentence is redundant and convoluted. Just reading this makes overwhelming. This can be rewritten as:
It looks like you spilled strawberry juice on your pants.
That sentence accomplishes the same thing. Nevertheless, this sentence is much more concise and clear.
A characteristic of production quality code is crucial for collaboration and maintainability in software development. Writing clean code is very important in an industry setting because working on a team continually iterating over its work. This makes it much easier for others to understand and reuse the code. In addition to being clean, the code should also be modular. In fact, code is logically broken up into functions and modules. Furthermore, an essential characteristic of production quality code makes code more organized, efficient, and reusable. In programming, a module is just a file. Similarly, encapsulate code can be used in a function and reuse it by calling the function in different places. On the other hand, modules allow code to be reused by encapsulating them into files that can be imported into other files.
To get a better understanding of what modular code is, try to think of it as putting clothes away. We could just put all clothes in a single container, but it would not be easy to find anything maybe because it has multiple reversions of the same shirt or socks. It would be much better if we had a drawer for tee-shirts, another one for just shirts, and another for socks. With this design, it will be much easier to tell someone else how to find the right shirt, pants, and a pair of socks. The same is true in writing modular code.
Splitting code into logical functions and modules allows finding relevant pieces of code quickly. Generalizing pieces of code to be reused in different place need to be considered to prevent from writing extra unnecessary lines of code. Abstracting out these details into these functions and modules can help in improving the readability of the code. Thus, programming in a way that makes it easier for a team to understand and iterate on is crucial for production.
Refactoring Code
Paying little attention to writing good code is easy. Specifically, focus on just getting it to work when start writing code for a new idea or task. Typically, it gets a little messy and repetitive at this stage of development. Furthermore, hard to know what is the best way to write code before it is finished. For example, it could be challenging to understand what functions would best modularize the steps in the code if we do not have enough experiment with the code to follow. Thus, going back to do some refactoring after achieved a working model is a must.
Code refactoring is a term for restructuring code to improve its internal structure without changing its external functionality. Refactoring allows cleaning and modularizing code after production. In the short-term, this might be a waste of time, since we could be moving on to the next feature. However, allocating time to refactoring code speed-up time. It will take the team to develop code in the long run. Refactoring code consistently not only makes it much easier to come back to it later, but it also allows us to reuse parts for different tasks and learn reliable programming techniques along the way. The more practice in refactoring the code, the more intuitive it becomes.
Efficient Code
It is essential to improve the efficiency of the code in addition to making it clean and modular in the refactoring process. There are two parts to making code efficient: reducing the time it takes to execute and reducing the amount of space it takes up and memory. Both can have a significant impact on a company or product’s performance. Therefore, it is important to practice this when working in a production environment.
However, it should be noted that how important it is to improve efficiency is context-dependent. Slow code, might be possible in one case and not another. For example, some batch data preparation processes, might not need to be optimized right away if it runs once every three days, for a few minutes. On the other hand, code used to generate posts to show on a social media feed needs to be relatively fast, since updates happen instantaneously. Moreover, spending lots of time refactoring to clean or optimize the code after it is working is essential. It is crucial to understand how valuable this process for a developer. Each time optimizing the code, we will pick up new knowledge and skills, which will make a more efficient programmer over time.
Documentation
Documentation is additional text or illustrated information that comes with or is embedded in the code of the software. Documentation helps clarify complex parts of programs, making code easier to read, navigate, and quickly conveying how and why different components of the program or algorithm are used. Several types of documentation can be added at different levels of the programs – first, the line-level documentation using in-line comments to clarify code. Second, the function or module-level documentation using docstrings to describe its purpose and details. Finally, the project-level documentation using various tools such as a readme file to document information on the project as a whole and how all the files work together.
In-line Comments
Texts following a hash symbol throughout code are in-line comments. They are used to explain parts of the code and help future contributors to understand. There are different ways comments are used and differences among great comments, okay comments, and even users’ comments. One way comments are used is to document the significant steps of complex code to help readers follow. For example, with the guiding comments on a function, future contributors do not need to understand the code to follow what the function does. Comments help to understand the purpose of each block of code, and even help to figure out individual lines of code or methods.
However, others would argue that using comments help to justify lousy code or code requires comments to follow. It is a sign of refactoring needed. Comments are valuable for explaining where code can not – for example, the history behind why a particular method was implemented in a specific way. Sometimes an unconventional or seemingly arbitrary approach may be used because of some undefined external variable causing side effects. These things are difficult to explain with code. These numbers for detecting edge levels in an image may seem arbitrary. Still, the programmer experimented with different numbers and realized that this was the one that worked for this specific use case.
Docstrings
Docstrings or documentation strings are valuable pieces of documentation that explain the functionality of any function or module in a code. Ideally, all of the functions in code should have docstrings. Triple quotes always surround a docstring. The first line of the docstring is a brief explanation of the function’s purpose. Single line docstrings are perfectly acceptable if one-line of documentation is sufficient to end the docstring. However, if the function is complicated enough to warrant a longer description, a more thorough paragraph after the one-line summary can be added. The next element of a docstring is an explanation of the function’s arguments. It should be something like listing the arguments, state their purpose, and state what types the arguments should be. Finally, it is common to provide some description of the output of the function. Every piece of the docstring is optional. However, docstrings are part of good coding practice¹. They assist the understanding of the produced code.


Project Documentation
Project documentation is essential for getting others to understand why and how a code is relevant, whether they are potentials users of projects or developers who may contribute to the code. A significant first step in project documentation is a README file. It will often be the first interaction most users will have with the project. Whether it is an application or a package, projects should come with a README file. At a minimum, this should explain what it does, list its dependencies, and provide sufficiently detailed instructions on how to use it. It must be as simple as possible for others to understand the purpose of the project, and quickly get something working.

Translating all ideas and thoughts formally on paper can be a little tricky, but it will be better over time and makes a significant difference in helping others realize the value of the project. Writing this documentation² can also help improve the design of the code. This also allows future contributors to know how to follow the original intentions.
Testing
Testing³ code is essential before deployment. It helps to catch errors and faulty conclusions before making any significant impact. Writing tests is a standard practice in software engineering. However, testing is often a practice that many data scientists are not familiar with when they first start in the industry. In fact, sometimes the insights data scientists come up with, which are supposed to be used for business decisions and company products, are based on the results of untested code. This lack of testing is a common complaint from other software developers working with data scientists. Without testing, an execution error sometimes occurs in code due to software issues. It could also be dictating business decisions and affect products based on faulty conclusions. Today, employers are looking for data scientists with the skills to properly prepare their code for an industry setting, which includes testing their code.
It is pretty obvious when a software program crashes. An error occurs, and the program stops running. However, many problems could happen in the Data Science process that is not as easily detectable as a functional error that crashes the program. All of the code can seem to run smoothly with entirely unaware that specific values will be encoded incorrectly. Additionally, features are being misused, or unexpected data were breaking assumptions.
These errors are more difficult to find because we have to check the quality and accuracy of analysis due to the quality of the code. Therefore, it is essential to apply proper testing to avoid surprises and have confidence in the results. In fact, testing has proven to have so many benefits that there is an entire development process based on it called Test-Driven Development⁴. This is a development process where writing tests for tasks before writing the code to implement those tasks.
Test-Driven Development
Test-driven development is a process of writing tests before writing the code that is being tested. This means the test would fail at first, and we will know how to finish implementing a task when this test passes. This way of developing code has a number of benefits that are made in standard practice in software engineering. As a simple example, we want to write a function that checks whether a string is a valid email address. Think of a few factors to consider, such as whether the string contains an "@" symbol and a period and write out a function that addresses them, and then test it manually in the terminal.
Try to input one valid and one invalid email address to make sure it works properly. Try with a few more valid and invalid email addresses, and one of them will give back the wrong result. Instead of doing this back and forth, try to create a test that checks for all the different scenarios⁵. This way, when we start to implement a function, we can run this test to get immediate feedback on whether it works or not in all the ways. Think of this process is a function tweaking. If the test passes, the implementation is done.
When refactoring or adding to the code, tests help rest assured that the rest of the code did not break while making those changes. Tests also help to ensure that the function is repeatable, regardless of external parameters, such as hardware and time. Test-driven development for data science is relatively new and has a lot of experimentation and breakthroughs appearing.
Log
A log is valuable for understanding the events that occur while running the program. Imagine that a model is running every night, and it is producing ridiculous results the next morning. Log messages can help to understand more about the cause, the context, and figure out how to address the issue. Since we are not physically there to see and debug when the problem occurred, it is essential to print out descriptive log messages to help in tracing back the issue and understand what is happening in the code.
Take a look at a few examples, and learn tips for writing good log messages.
Tip: Be professional and clear
BAD:
Hmmm... this isn't working???
BAD:
idk.... :(
GOOD:
Could not parse file.
Tip: Be concise and use normal capitalization
BAD:
Start Product Recommendation Process.
BAD:
We have completed the steps necessary and will now proceed with the recommendation process for the records in our product database.
GOOD:
Generating product recommendations.
Tip: Choose the appropriate level for logging
DEBUG
level you would use for anything that happens in the program.
ERROR
level to record any error that occurs.
INFO
level to record all actions that are user-driven or system specific, such as regularly scheduled operations.
Tip: Provide any useful information
BAD:
Failed to read location data.
GOOD:
Failed to read location data: store_id 8324971.
Code Reviews
Code reviews⁶ benefit everyone in a team to promote best programming practices and prepare code for production. Code reviews are a common practice⁷ at work, and for a good reason. Reviewing each other’s code can help catch errors, ensure readability, check that standards are being met for production-level code, and share knowledge among a team. They are beneficial for the reviewer and the team. Ideally, a Data Scientist code is reviewed by another data scientist since there are specific errors and standards to check for specifically in data science-for example, data leakages, misinterpretation of features, or inappropriate evaluation methods.
Look over some of the questions while reviewing code.
Is the code clean and modular?
* Can I understand the code easily?
* Does it use meaningful names and whitespace?
* Is there duplicated code?
* Can you provide another layer of abstraction?
* Is each function and module necessary?
* Is each function or module too long?
Is the code efficient?
* Are there loops or other steps we can vectorize?
* Can we use better data structures to optimize any steps?
* Can we shorten the number of calculations needed for any steps?
* Can we use generators or multiprocessing to optimize any steps?
Is documentation effective?
* Are in-line comments concise and meaningful?
* Is there complex code that's missing documentation?
* Do function use effective docstrings?
* Is the necessary project documentation provided?
Is the code well tested?
* Does the code high test coverage?
* Do tests check for interesting cases?
* Are the tests readable?
* Can the tests be made more efficient?
Is the logging effective?
* Are log messages clear, concise, and professional?
* Do they include all relevant and useful information?
* Do they use the appropriate logging level?
Some tips on how to actually write a code review.
Tip: Use a code linter
This can save lots of time from code review. Code linter can automatically check for coding standards. It is also a good idea to agree on a style guide as a team to handle disagreements on code style, whether that is an existing style guide or created together incrementally as a team.
Tip: Explain issues and make suggestions
Rather than commanding people to change their code in a specific way, it will go a long way to explain to them the consequences of the current code and suggest changes to improve it. They will be much more receptive to the feedback if they understand the process and are accepting recommendations, rather than following commands. They also may have done it a certain way intentionally, and framing it as a suggestion promotes a constructive discussion, rather than opposition.
BAD:
Make model evaluation code its own module - too repetitive.
BETTER:
Make the model evaluation code its own module. This will simplify models.py to be less repetitive and focus primarily on building models.
GOOD:
How about we consider making the model evaluation code its own module? This would simplify models.py to only include code for building models. Organizing these evaluations methods into separate functions would also allow us to reuse them with different models without repeating code.
Tip: Keep your comments objective
Try to avoid using the words "I" and "you" in the comments. Avoid comments that sound personal to bring the attention of the review to the code and not to themselves.
BAD:
I wouldn't groupby genre twice like you did here... Just compute it once and use that for your aggregations.
BAD:
You create this groupby dataframe twice here. Just compute it once, save it as groupby_genre and then use that to get your average prices and views.
GOOD:
Can we group by genre at the beginning of the function and then save that as a groupby object? We could then reference that object to get the average prices and views without computing groupby twice.
Tip: Provide code examples
When providing a code review, save the author time and make it easy for them to act on the feedback by writing out code suggestions. This shows that we are willing to spend some extra time to review their code and help them out. It can also just be much quicker to demonstrate concepts through code rather than explanations.
Sample of reviewing code:
first_names = []
last_names = []
for name in enumerate(df.name):
first, last = name.split(' ')
first_names.append(first)
last_names.append(last)
df['first_name'] = first_names
df['last_names'] = last_names
BAD:
You can do this all in one step by using the pandas str.split method.
GOOD:
We can actually simplify this step to the line below using the pandas str.split method.
df['first_name'], df['last_name'] = df['name'].str.split(' ', 1).str
References
¹ PEP 257 – Docstring Conventions ² Bootstrap Github ³ Getting Started Testing by Ned Batchelder ⁴ Four Ways Data Science Goes Wrong and How Test-Driven Data Analysis Can Help ⁵ Integration Testing ⁶ Guidelines for Code Reviews ⁷ Code Review Best Practices
Disclaimer: This article is based on Python programming language. Meaning, the sample of code and documentation will be using Python as references.