Reading List

The Last Mile in Shipping Data Science Projects Well

What are the best practices in data science documentation?

Elliot Gunn
Towards Data Science
3 min readApr 13, 2021

--

Photo by Sigmund on Unsplash

Data science projects increasingly use a complicated tech stack, and documentation that omits important information may prove to be a failure point when the project changes hands. I experienced this firsthand during a data science internship. Our team was brought on to refine and extend an existing machine learning model in production. We took care of a lot of data cleaning, as the dataset was dirty and lacked a data dictionary. We inherited a significant codebase from the previous team–without any documentation of their process or assumptions.

We ended up starting over from scratch. It was easier than struggling to figure out every single line of code. This was an instructive experience to those of us who were relatively new to data science at the time, and we made it a priority to keep detailed documentation ourselves. While it was admittedly boring to stay on top of documentation as our codebase grew in complexity, it was the only way we could ensure that the project’s outcome was reproducible and legible to the next team inheriting our project.

Documentation isn’t typically something we learn in a course but rather through hands-on experience at a job or under a mentor’s direction. This can leave students unaware of the importance of proper and extensive documentation. However, as I combed through the TDS archives, I discovered several TDS authors who have taken the time to share the profession’s best practices. These posts teach you how to create clear and accurate documentation that provides a solid foundation for your project’s continued success during team transitions and hand-offs.

How do you build a “documentation-first culture”? Prukalpa shares how the data team at Atlan approached this challenge. It wasn’t an overnight process; the team brainstormed, set goals, iterated continuously, and implemented fixes to the documentation framework. Their success in incorporating documentation into the daily workflow suggests that it is possible for other data teams to adopt their method.

A README file is usually the first thing you look at when you explore a new open-source tool or wish to learn how a project works. Navendu Pottekkat’s guide to writing a kickass README breaks down the key components to a useful README. He argues that “If people don’t know what your software does, then they won’t use it or contribute to it and they will most likely find something more clear and concise in the sea of open-source software.” Navendu’s post goes to a level of detail that is unusual for README guides; he takes care to include tips that make them visually appealing too.

Adam Gajtkowski shares a short guide to using R Markdown and LaTeX to write professional data science documentation. To those who don’t use it regularly, LaTeX may come with a steep learning curve. Nonetheless, learning LaTeX is a worthwhile investment as it provides consistent formatting of both text and mathematical formulas.

If you’re still not convinced that documentation is essential, Admond Lee makes the case for its importance for data scientists. He argues that documentation ensures reproducibility and project completion, and suggests that it’s always a good idea for a peer to look it over for readability.

Do you have favourite resources on writing better documentation? What are some best practices that your team uses? Drop some links, I would love to learn more in the comments!

--

--