Many great developments in data science have been made in the last decade but despite these achievements, many projects never see the light of day. As data scientists we must not only show strong technical skills but also understand the Business context, effectively communicate with stakeholders, and translate their questions into actionable recommendations that drive business value. Is this reasonable or is the business looking for the new unicorns? In this blog, I will describe how the business has changed over the years, which will give a better perspective on what you may need to learn to successfully deliver data science projects.
A short introduction
More than a decade ago companies recognized that mining data sets can result in information that can increase revenue, optimize processes, and lower the (production) costs. This has led to a new field with new roles; the Data Science field with data scientists. But the needs of the business keep changing over the years. It is thus important to understand the needs of the business to know what you need to learn as a data scientist. In the next section, we will first zoom out to describe how the data science field has evolved over the last decade. This can help you to know; 1. what was important to learn, 2. what is now important to learn, and 3. what can be important to learn for future endeavors. Let’s go back in time in the next section.
First, there were the scientific programmers.
Data science has many disciplines from which the basis is built on statistics and mathematics that originate from decades of (academic) research and development. Many of the original core algorithms form the fundamentals in disciplines such as text mining, image recognition, sensoring, and time series. In the early days, these methods were published without the accompanying code. To apply the method, companies hired scientific programmers to do the challenging and time-consuming task of method implementation. But before writing a single line of code, there was usually a process of thinking why the effort should be taken, and what kind of results could be expected. Over the last decade, this has changed dramatically because companies such as Google, Meta, etc started open-sourcing their libraries. In addition, communities started developing open-source packages such as sklearn, scipy, and many more. An installation is now just a single line of code.
The data science field is fastly evolving but what does the business need?
Nowadays, scientific programmers have become data scientists. However, something has changed. The business also needs data scientists that can communicate effectively with stakeholders, identify business opportunities, and translate technical insights into actionable recommendations that drive business value. This has led to a new kind of data scientist; the applied data scientist.
The Applied vs. Fundamental Data scientists
The "data scientist" is often used as a collective name for different roles in the data science field. From data steward, ML engineer, data engineer, statistician, and many more. But when we talk about the real data scientist, there are basically two types; the fundamental data scientist and the applied data scientist.
- The fundamental data scientist has a strong knowledge in statistical and machine Learning techniques to analyze complex data sets and derive insights. This person can tell you everything about the underlying data distributions, and can easily create (alterations in) algorithms/ methods that will solve the problem. These individuals usually accelerate in research and development environments, and academic institutions.
- The Applied Data scientist focuses on applying existing techniques and methods to solve specific business problems or to develop data-driven products and/or services. Usually, these roles accelerate in one domain, such as in text mining, image recognition, sensoring, or time series. Innovation is usually achieved by applying new methods on their domain-related data, and thus not creating new algorithms or methods.
Both roles have their own strengths and weaknesses for which I have three tips that can help to successfully deliver data science projects.
Tip 1: Learn the fundamentals of programming.
Thanks to great platforms such as Coursera, Udemy, Youtube, and Medium, there is plenty of material to learn programming fundamentals.
- Write your code in known styles, such as PEP8.
- Write inline comments; what/why you do it.
- Write docstrings.
- Use sensible variable names.
- Lower your code complexity (like a lot).
- Write Unit tests.
- Write Documentation.
- Keep it clean.
Programming is one of the major challenges in the data science field. It is heavily underestimated but one of the key components that can make or break a data science project for getting into production. Think about this, do you want to maintain a model or programming code without documentation, unit tests, and written in spaghetti-style? I don’t think so.
Each data science project requires reproducible code, and the step towards production requires maintainable code. In the end, each project is merely a bounch of lines of code that someone needs to get into production. Keep it clean. Keep it tidy.
Tip 2: The success of a project is more than only a machine-learning solution
Data science projects usually start with lots of enthusiasm but that can quickly vague away because projects need much more than only a machine learning solution. In a recently published article [1] it is very well described what the most important technical steps are in a data science project. However, to go from an idea to production you need more than only technical skills. A summary of the steps that can help to increase the success of a project is as follows:
- Start with the end in mind. Know where and how to land your project in the organization or company at the start of the project. Data governance, ethics, and privacy are important to start with.
- Check which platform or infrastructure to collaborate on. This can for example be git with CI/CD pipelines, and cookie-cutter templates.
- Understand the domain. Before doing any analysis, it requires a basic understanding of the domain you are working in. You need to know how to handle your data with respect to the field and context you are working in. There is no such thing as a one-size-fits-all data science solution.
- Do your data analysis correctly. This may seem trivial but knowing how to pip install a package does not make you an expert. Do your own research and read articles. Avoid (complex) machine learning solutions that you can not explain. Use train-test-validation sets. Compare your results with baselines. Discuss your ideas and results with an experienced scientist and those with domain knowledge.
- Report your results. Be transparent. Tell the story fact-based. Do not generalize the story beyond the data. Describing the journey is more important than that single number that came out of the model.
- Write reproducible and maintainable code. Demonstrate that the results are reproducible, and the code is maintainable.
- Hand over the results. If all steps are done, the results or product needs to be hand-over to the customer in a manner that they can work with it. Handing over your own laptop with the working code is not the solution.
If you look carefully at these steps, there is only one step (#4) where the data is analyzed and models are created. Let that sink in.
Tip 3: Be smart, learn, and repeat.
Data science is a highly complicated and fast-evolving field where different specializations come together. Every data scientist has a different background and continuous learning is part of the deal. This means that a personalized learn-and-grow path can be of great benefit which depends on your degree/starting point, experience, domain knowledge, background in math, statistics, programming, engineering, communication, and presentation skills. Discuss with your peers what you can improve and make a personal roadmap of what to learn and how to learn it. Note that taking random data science courses may be interesting but can be misaligned with the mission of the company or even your personal grow-path.
The ability to learn is a muscle everyone should keep practicing, and being a life-long learner is probably the best gift you can give to yourself.
There is always more to learn.
The path to success is not that particular single course on the web that you need to do but it may require years and likely decades of dedication, hard work, and struggles at the same time. Invest in yourself, learn the fundamentals, go beyond shallow knowledge, specialize, and realize that success is the accumulation of many small steps from which the modeling part is only one single step in the entire process.
Let me try to break this down into sub-parts. First, communication is very important. Maybe you can make the most genius method but you need to effectively articulate the complex technical concepts to both technical and non-technical stakeholders. Problem-solving: you should be able to approach complex issues with a structured and systematic mindset. Think critically, analyze problems from multiple angles, and propose effective solutions. You can easily practice by helping the community on websites such as Stack Overflow. While you grow in your career and seniority you should be able to mentor and coach developers. Provide guidance, share best practices, and help to enhance in their technical skills. Be adaptable. You should not stick to the one technique that you know but embrace new technologies, methodologies, and tools. You should be able to learn quickly and adapt to changing project requirements or industry trends. Time management. Manage your time effectively. Prioritize tasks, meet deadlines, and balance competing demands. Stay focused on delivering quality work.
Be Safe. Stay Frosty.
Cheers E.
If you find this article helpful, you are welcome to follow me because I write more about Bayesian causal learning. If you are thinking of taking a Medium membership, you can support my work a bit by using my referral link. It is the same price as a coffee but allows you to read unlimited articles monthly.
Let’s connect!
References
- Michael A. Lones, How to avoid machine learning pitfalls: a guide for academic researchers, arXiv: 2108.02497
- Tessa Xie, Data Science career mistakes avoid, 2021,
- Is data scientist becoming an obsolete job? Data Science Central