Why data scientists still can’t code.
Disclaimer: not all data scientists do, or even should have to, write production grade code. Whether they should is ultimately down to context. But if they could, it would make the field a much better place.

Common knowledge would have you think a data scientist spends the majority of their time modelling and evaluating those models. This is a falsehood. For many data scientists, the majority of their time is spent developing data pipelines which act as a requisite precursor to machine learning. Such pipelines do not come out of thin air, and failing the use of some third party plug-and-play, drag-and-drop suite, they’re coded. As such, it’s necessary that the code of these pipelines be up to scratch.
However, there’s a growing sense amongst techies that data scientists cannot code well. This view is at once valid, worrying, and wholly unsurprising. While much of the code written by data scientists are for EDA pieces, rapid prototyping, and throwaway analyses, if they’re doing their job properly a relatively significant chunk will see production release. Unfortunately, the code that sees prod release is oftentimes written in a way that draws from the disposable analysis mindset, and is nowhere near the mark of production grade. Indeed, sometimes the code is so bad that ops teams will refuse to ship it.
So, why can’t data scientists code well?
- Data scientists hail from a wide array of academic disciplines, many of which lack exposure to the set of skills required for writing quality code. Required skills include the principles of software engineering, programming paradigms, clean coding tips (linked for Python), testing, logging and instrumentation, folder structure conventions, version control, and much more if you want to dip into DevOps territory. Bar perhaps computer science, the academic disciplines from which data scientists hail— typically — favour awarding code that produces the expected output, overlooking whether the code itself is spaghetti or well written. This sets a bad precedent, and is a harmful POV that carries on into the real world.
- Since data science is such an ill defined and all encompassing term, ranging from people with doctorates in Deep Learning to people who knock up Tableau dashboards, every man and his dog has decided to change their LinkedIn title because the term is still in vogue. What does this mean? Far from the Duck Typing test that “if it walks like a duck and quacks like a duck, then it must be a duck”, management who lack technical know-how see a data scientist as what they’re not, instead of what they are, consequently overburdening their skillset, leading to low quality implementations to problems which require multi-disciplined solutions.
- Network effects and the nascent market. From a technical standpoint, a data scientist’s strengths and weaknesses translate to a deeper or shallower understanding of the various subcomponents of the DS project pipeline. For more well-oiled and experienced teams, the pipeline is a machine, and bares well defined separation of concerns for each subcomponent, whereby each is supervised by a respective expert who assumes responsibility. Such an operation will typically have capable developers who demand “clean code”; everyone must pick up their coding game. For less experienced teams, a slickly functioning pipeline is elusive in the best of cases. There’s too many unknown unknowns. The field of data science in its current form is immature and poorly regulated by healthy ways of working, and as a result, data scientists hire data scientists who conform to their own set of skills and do little to polish the weaker parts of the team’s pipeline. Since many data scientists are weak developers, you end up with weak developers hiring weak developers because they don’t know any better, and although a team might consist of many strong ML modellers, they won’t have the foggiest clue about writing a high quality, maintainable code.
- The universality of Jupyter Notebooks. As a staunch advocate of using Notebooks for DS, if pains me to put this as a leading cause. When used properly, Notebooks can be a great model for presenting EDA findings, and as a playground for rapid prototyping. However, best practices around the use of such notebooks are scarcely found in the wild, and I’ve been witness to far too many scrappy, half finished Notebook instances with uncommented, inelegant, and difficult to read code, always written by a single person (distributed version control is really difficult with Notebooks, since they’re basically giant JSON files). And unlike fully fledged IDEs, there’s no code cleanup prompts conforming to pep 8 standards.
Stay tuned for a Part 2 featuring best practices for developing data science pipelines.

