Starting my job in 2019, I came into the team with another developer to clean up the codebase and stabilize it. The first three months consisted of the planning phase in which we read the code, determined the architecture moving forward, and set up the infrastructure that our team now runs off of. Looking back, it was a vast undertaking that taught me so much about automation, code development, and collaboration. But the biggest lesson learned from this experience was the importance of Clean Code.
One resource I used heavily during this project was a software book that I reference often. If you haven’t read it yet, Clean Code: A Handbook of Agile Software Craftsmanship by Robert C. Martin is a great book. I’ve spoken about this book previously in an article discussing my top 3 recommendations for data scientists, and I still stand by it. Clean Code is a perfect book for anyone who writes code. This book helped me understand the codebase areas that could be better as I worked with my teammate to refactor 30+ repositories down to one software library. Our goal was to create a maintainable software library that was readable to allow any data scientist on the team to pick up the code and understand it.
After the first three months, the project took off and began to gain traction with different stakeholders who noticed what we were doing. Some began to see their vision coming true as the team matured, and others felt we were only making their jobs harder at first. The team we needed to convince were the software engineers who focused on the customer-facing application. When we began to release the library to our team, we made it clear to the software team that we were no longer supporting the 30+ repositories and were halting development. This meant they needed to begin to ingest our new library.
Reliability and readability became a considerable part of this work, especially with the software engineers beginning to ingest our codebase. We had to teach them to ingest, run, and develop results from our code. This meant we spent extra time on the small things. We focused on reliability and readability to make it easier for other data scientists to pick up the code and add to it and reassure the software engineers that the code was stable. At the time, most of the Data Science team was unfamiliar with object oriented programming and how to write clean, practical functions. Knowing this, the areas we focused heavily on were:
- Writing detailed and easy to read variable, function, and class names.
- Documenting every function, argument, and return statement in the library with descriptions detailing the items’ purpose. For functions or classes that needed it, we added detailed examples of how to use the code.
- Writing functions and classes that would reduce time to do specific analyses. If the team was commonly doing particular tasks, we spent time understanding those tasks and creating code in the library that would work for them.
- Developing a hosted website of documentation to share with the team. This webpage consisted of the library documentation, onboarding documentation, how-to guides, and architectural diagrams necessary for smooth team operations.
- Incorporating automation in the form of CI/CD pipelines to test the code for any issues that could arise from a pull request. Resulting in a stable codebase that would only be altered after a pull request was opened, reviewed, and had passed all its testing.
As we finished the restructuring of the codebase and the software team began to ingest it, having the cleaner and more reliable code became noticeable. We were no longer trying to passcode changes through emails or had good visibility into who was making the changes and when. Instead, we had the processes to see how the code was developing over time to revert if needed.
We developed a smooth onboarding process through these updates that helped reduce the time to onboard new people from over a month to less than a week. This was due to documentation and tutorials that walked them through all the steps necessary to do their job. As we began integrating more closely with the software team, we presented them with our architectural diagram and detailed documentation processes. This helped them determine how best to integrate their work with ours and plan out the next steps needed to consume our analyses. And lastly, data scientists could easily pick up a different part of the codebase and learn what it does relatively quickly. If it is unclear, they work with the original developer or a senior member of the team to make the next person’s code cleaner.
Final Thoughts
It may take a bit longer to refactor and clean your code before opening a pull request, but the biggest lesson learned from this experience was the importance of clean code. Clean code and documentation can aid in the onboarding of new employees, help when significant process changes come into play, new teams become integrated with yours, or a developer wants to start working with a different part of the codebase than they are used to. Clean code isn’t for you, in the hear and now. Clean code is for the next person who will need it when they start looking at that code. Clean code is for the future of you who will need to work on something you haven’t looked at in forever and need a refresher on. So the next time you are working on some code, and are ready to open a PR, think – how can I clean this up and document it for the next person? What will be helpful in the future when I look back at this? Don’t leave a mess for someone else to pick up.
If you would like to read more, check out some of my other articles below!
7 Lessons Learned from 7 Months of Remote Mentoring