Data Science Strategic Guide — Part 1

Get Smarter with Data Science — Tackling Real Enterprise Challenges

Take your Data Science Projects from Zero to Production

Dipanjan (DJ) Sarkar
Towards Data Science
17 min readDec 19, 2018

--

Introduction

The ‘Data Science Strategic Guide — Get Smarter with Data Science’ is envisioned as a series of articles, which serve to be more of a strategic guide depicting essential challenges, pitfalls and principles to keep in mind when implementing and executing data science projects in the real-world. We will also cover how you can get maximum value from data science and artificial intelligence, by focusing on very real perspectives and staying far away from the hype. This should enable you to drive success in the industry in your own domain! The focus here is more towards real-world projects being executed in the industry, however some of these principles are also applicable towards research.

Source: https://xkcd.com/

Typically, most of my articles are hands-on oriented, targeted towards people building systems and actually doing data science. However, this guide is aimed at a broader audience including executives, business, architects, analysts, engineers and data scientists. In my opinion and experience, you need all of them for executing data science projects successfully and getting the maximum value!

Are you tired of your data science projects stuck at just being proof-of-concepts?

Don’t you love it when your projects finally get pushed out to production and start working on real-world data?

Do you love providing actionable insights from data to drive business goals?

Do you want to build a successful and effective data science team?

Can I build an effective data science and AI strategy for my team?

If you answered yes to at least one of these questions, this guide is for you! We will be covering the following major aspects in this guide through a series of articles:

  • Part 1 — Current Challenges and Potential Solutions
  • Part 2 — Building Effective Data Science Teams
  • Part 3 — Process Models for Data Science Projects
  • Part 4 — Effective Data Science Pipelines
  • Part 5 — Driving Success in the Industry

All opinions expressed in this guide are based on distilled personal experiences and by looking at industry trends and talking to industry experts. The intent of this guide is not to spread any bias or prejudice but to just give a clear idea of the core components to examine when executing a data science project in the enterprise. I still don’t consider myself to be an expert in this domain (there is so much to learn!) but I hope this guide helps you gain some useful perspective towards effective execution of data science projects.

In this particular article, we will focus on some very real challenges plaguing the industry with regard to executing data science projects and some potential solutions.

Current Challenges and Potential Solutions

Most data scientists (including myself) love the availability of ready to use tools, libraries and frameworks. We have our own personal preferences when working on solving different problems. Due to ad-hoc usage of these tools and methodologies coupled with our preferences, it leads to increased effort when actually trying to move towards deploying and maintaining data science project artifacts and assets.

Image result for xkcd software
Source: https://xkcd.com/

In this section we will take a look at some of the most important challenges and pitfalls which often plague data science projects from never coming out of the proof-of-concept phase and some guidance on how we could tackle them.

Fragmented Technology Landscape

The technology landscape for data science tools is huge and is getting even bigger each day. Throw in big data, artificial intelligence and several more terms and you see an even bigger landscape of tools, libraries and frameworks. In data science, thinking about a data scientist’s perspective, it’s all about using the best possible (or most easy to use and comprehend) tool for solving a problem, because methods — statistical or machine learning \ deep learning are all built on top of math, statisitics and optimization and as long as the implemented algorithms or methods are uniform, the specific tool or framework use doesn’t matter to them that much. I’m sure you must have heard of the saying, ‘Don’t focus on tools and frameworks. Focus on the problem to be solved!’.

Source: https://xkcd.com/

This is not wrong. Typically a data scientist (like myself) might like to start looking at the data, open up R or Python and start writing code to do some analysis or build a model. We have two extremes typically based on usage patterns among folks doing data science. Some like using programming languages like R, Python, Scala or Java coupled with frameworks and libraries which enable them to do complex analyses at ease. Others like using ‘no-code, graphical drag and drop interface’ based tools like KNIME, RapidMiner, Weka and so on.

Having a mentality of using the tools I know best, can be destructive for a data science project due to the lack of standardization, minimal architecture and especially if people who built a project end up leaving the company after it is deployed. Building a proof-of-concept is perfectly fine but also think about the following aspects when you need to move your project to production — usage patterns, scalability, standardized architecture, integration with existing systems and maintainability.

Tendency to Follow the Herd (and Hype)

Let’s face it, I’m sure we have all seen this happen in the industry especially with all the hype being generated around artificial intelligence, automated machine learning, citizen data scientist and more. Having a proper data science and AI strategy is needed right from the C-level executives sitting and defining what are their key business and strategic objectives, data availability and necessity and how data science and analytics can help them in achieving these goals. They would of course need to consult with domain experts and employees maintaining a key balance between technology and business. Remember data science in combination with existing business processes is what drives success — both work in tandem and not in isolation.

Source: https://xkcd.com/

Running blindly after the hype of big data, data science or deep learning and hastily building strategies without a purpose can be detrimental to time, money and effort invested in these areas. Strategies like ‘Training the entire workforce on AI’, ‘Adopting AI across all business verticals by a set year’ or ‘Creating a unified data lake for all enterprise data’ are sure shot red flags if there is no end goal or outcome defined as to why these are being done. These are not tall claims and interestingly you can check out this article on 3 Reasons Why Data Lakes Have Not Delivered Business Value after all the huge hype in the past couple of years and this article from CIO.com also tells us,

“Data lakes will need to demonstrate business value or die”

Always tie back any strategy to well-defined and clear outcomes and key performance indicators which can be used to measure them. These do not have to be set in stone but shouldn’t be too ambiguous either. Not an easy task but needs to be done.

Lack of Reproducibility and Reusable Artifacts

Data Science analyses and code need to be reproducible. How many times have we heard these lines — ‘The model was giving me 90% accuracy last week’ or ‘The code was working fine in my machine!’. Hey, even I’ve been there, guilty as charged! A simple thing like setting random seeds when working with programming languages and frameworks, go a long way into helping with reproducible analyses. Besides this, model, data and featureset versioning is essential if you want to track over time ‘which model trained on what data gave what kind of performance metrics?’ and then leverage the best model or revert back to a previous model. Leveraging virtual environments and containers are great ways of making sure you are not stuck in ‘package dependency hell’.

Source: https://xkcd.com/

Jupyter notebooks are a great way of enabling reproducible research and data analysis and is often the tool of choice for data scientists. I would definitely recommend anyone interested to check out this wonderful article which talks about how notebooks can enable reproducibility in business.

Even Jake Vanderplas has a nice collection of videos and tutorials for reproducible data analysis with Jupyter notebooks in his article. Another interesting paper to check out is ‘Ten Simple Rules for Reproducible Research in Jupyter Notebooks’ by Adam Rule et al. which is available here. They talk about a set of rules to serve as a guide to scientists with a specific focus on computational notebook systems.

Source: https://arxiv.org/abs/1810.08055

But everything is not perfect and rosy as it seems with several caveats and points to rememeber when working with Jupyter notebooks. I recommend everyone to definitely view this wonderful talk from Joel Grus, ‘I don’t like notebooks’ which was presented ironically at JupyterCon 2018!

Another important aspect here is reusability. Given that many data scientists often like to work alone in an ad-hoc manner, there is a lack of proper collaboration. Projects as well as people churn at a more rapid pace than ever before. Hence if your project is built around a bunch of undocumented code artifacts and scripts lying around in different systems, you are in for a rude awakening if the developers of these artifacts suddenly leave. Besides this, often there are specific usage patterns which are repeatable like data extraction, ETL (extract-transform-load), feature engineering and even modeling on the same datasets. Developing reusable artifacts and components and making them easily accessible in a common repository will save a lot of time in the future, rather than data scientists re-creating these artifacts for each and every new project.

Lack of Collaboration — Breaking The Silo Mentality

Data Scientists often like to work at their own pace and in an ad-hoc manner doing experiments and analyzing data. This helps with creative thinking and innovation. While this is not bad at all, we definitely need to start collaborating and sharing more about the work we are doing and also work in teams (sometimes pair programming works really well!).

Source: https://xkcd.com/

One of the best ways is to make sure there are regular meetings around work being done in a project, specific well-defined deliverables and tasks around what everyone is doing and having a mix of tasks where a data scientist can work by themselves and also work in tandem with other data scientists and even engineers. This would help in fostering knowledge sharing, collaborative work and the project not sinking if someone suddenly leaves.

A more critical pitfall to avoid is — ‘The Silo Mentality’ which is basically a mindset, present when certain departments or sectors do not wish to share information with others in the same company.

I am sure a lot of you have faced this situation before and face this even to this day. People and teams need to realize that only by sharing and collaboration, great things can be built, problems can be solved and projects can be successful. Having a silo mentality leads to loss of productivity, efficiency, reduced morale and eventually projects end up getting scrapped. Collaboration, working towards a common goal, proper incentives and motivation can go a long way in fostering a healthy and productive work culture for data science. Encourage engineers, analysts, scientists and architects to work together in a project and yet be clear about their own specific tasks.

Outsourcing Critical Data Science projects to Third Party Firms

Don’t get me wrong, we have a lot of pure-play analytics firms who have the right expertise to provide key insights to companies looking to be more data driven. However, if you are planning to really invest long term into reaping the benefits of data science for driving business decisions, invest in a good data science team internally (or many depending on the company size, verticals and so on). The benefits of this are manifold. You wouldn’t have to worry about firms taking you for a ride or consultants working on projects and then you having to end up suffering through the pain of maintaining them, when they conveniently disappear after their contract is up typically at the end of the proof-of-concept or MVP (minimum viable product) stage. You would also never get the same level of trust as you would get from your own employees dedicated to the cause. That being said, if you have a lot of data science projects in the pipeline and have enough budget to spare, feel free to offload less critical short-term projects to these pure-play third party analytics firms.

Besides this, please get over the much hyped concepts of citizen data scientists and automated machine learning. Let’s face the facts here! Citizen data scientists can enable you to be more data driven but not everyone can solve tough problems for you. You need real data scientists building models and systems and churning our insights. Automated machine learning is not a silver bullet for solving every problem. Don’t get disillusioned by the fact that you can just connect a data source to one of these tools and start getting ready made insights. Many companies have made this mistake and are not stuck with a suite of tools at their disposal and them not knowing what to do with them. You need to know how to use these tools effectively to really help cut down the development time in your data science projects.

Technical Debt — Lack of Standards and Project Architecture

This is something which is true for any technology and engineering company associated with building software and products. Often people doing data science projects think, ‘These are software engineering methodologies and principles — we are well above this’. No! you are definitely not above all this and not realizing this, can sink a huge hole in your really complex deep learning model, leading to yet another failed proof-of-concept which couldn’t make it to production.

Source: https://xkcd.com/

Enable proper coding standards across teams. Build reusable assets where necessary but don’t overdo it. A key aspect is to always not just focus on tools and frameworks for data science but also non-functional requirements (NFRs). Popular NFRs include scalability, maintanability, availability and so on. Just like we have enterprise architecture for software projects, we need to have architects and data scientists working with them to define a project architecture (solution — application & data) for each data science project which has an end goal of being deployed to production. A standard layered enterprise architecture is depicted in the following figure.

Layered Enterprise Architecture Hierarchy (Source: IBM)

A lot of this is taken from an excellent guide called ‘Architectural thinking in the Wild West of data science’ which actually talks about a lot of these challenges in data science projects and also potential advantages of defining project architecture.

Typically, an enterprise architect defines standards and guidelines that are valid across the entire enterprise. A solution architect works within the framework that the enterprise architect defines. This role defines what technological components fit specific projects and use cases. Optionally you also have application architects who focus more on components pertaining to the application within the framework of the solution architecture and data architects, who define the data-related components. Often the solution architect usually fills these roles.

Usually, data scientists would rarely interact with the enterprise architect and work more with the solution architect (and application \ data architects) directly. Data Scientists should be able to envision an end-to-end solution architecture for their projects and also even provide essential inputs to architects to improve and transform the enterprise architecture over time. Evolution is key and since data science is an emerging and innovative field, you will need to come out of the regular software project architecture templates for data science projects.

Force-fitting Tools and Industry Standard Processes

Just because a process model works for specific projects or verticals doesn’t mean it will be effective for data science. A lot of enterprises, particularly companies with a larger employee base tend to use a lot of processes in their daily operations and project executions. We used to have the waterfall model. Now we have Agile, Kanban, Scaled Agile, Scrum , Scrumban and so many more variations! The intent here is to not follow these process frameworks to the letter and adapt them into something which might be best for your team and also based on the type of projects. Working on data science which directly ties up with product development? Maybe focus more on an agile mindset driven by scrum. Working on ad-hoc projects which are more exploratory or consulting in nature? Maybe Kanban or Scrumban might suit you better. The key idea is to start with a set of standards and have the flexibility to evolve over time. Be open to change but change something for a reason!

Source: https://xkcd.com/

With regard to tools and frameworks, like we have mentioned before, the right tools or frameworks are indeed based on the problem to be solved but always have a set of standards and guidelines established with regard to usage starting from defined workflows, patterns, coding standards and architecture. Also don’t force-fit tools to problems. No, automated machine learning will not automatically give you the best model or insights from raw data. You still need to understand the business problem, the success criteria and the right way to use these tools to help solve your problems in an effective way. Just like you never optimize a model to give maximum accuracy on every problem, don’t end up using a tool without a sense of why you want to use it.

The Gap between Data Science and Engineering

Working as data scientists, often we tend to ignore the fact that the pure data science or machine learning component is just a very small component among all the other components which build the overall project or system together. The NIPS 2015 paper, ‘Hidden Technical Debt in Machine Learning Systems’ explains this perfectly in the form of the following visual.

Major components of a Data Science Project \ System

The ML Code component is the very small black box you see in the above figure. But often data scientists are more focused in developing algorithms, building models, exploring and analyzing data. Thus, this ends up in them forgetting about the overall big picture more than often. This inevitably leads to the project staying as a proof-of-concept and never being deployed. A classic example of this was the lessons learnt from the Netflix Prize Challenge. The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings. If you check out the lessons they learnt from the prize which they mentioned in their paper, ‘Mining Large Streams of User Data for Personalized Recommendations’, they had to extract two core algorithms from the 107 models which the winning solution had! Then, they had to scale these to handle more than 5 billion ratings. Besides this, they also mention not including several models due to the sheer effort of engineering needed to actually put these models into production.

“At Netflix, we evaluated some of the new methods included in the final solution. The additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”

This clearly tells us that for the whole system to function properly, even data scientists need to have a holistic view of the problem and the end-to-end solution. Hence for a data science project to get deployed in production, you need a combination of data science, data engineering, software engineering, architecture, infrastructure, monitoring and quality checks. Ignoring engineering aspects and saying I just build machine learning models is no longer something any data scientist working in the industry should be saying. Learn to step up to the plate and focus on engineering aspects of your project as and when needed and also leverage the engineering team to the maximum — constant collaboration helps drive good synergies.

Ignoring Quality Checks and Operations

Often people working in data science teams think that their work is done after training and evaluating their model on the ‘so-called’ test dataset. The lack of quality, monitoring and control checks can lead to a data science project collapsing even after deployment — when it is being used which is often more critical than the actual development phase!

Source: https://xkcd.com/

Always remember to validate your models by using effective strategies like validation datasets and cross-validation. Keep a check for issues like data leakage and target label leakage. Another point to remember here is that when tuning models, don’t keep tuning your model on the training data and check the performance on the test data. Then you end up indirectly training your model on the test data! Always tie back data science and machine learning performance evaluation metrics with your business metrics and success indicators. A churn model might require more recall and a fraud detection model might require more precision or it can even be vice-versa depending on what the business needs! Remember to keep revisiting your project’s success criteria and indicators over time because that’s what is key to your projects getting deployed in production.

Are there other ways to evaluate and keep checking the quality of your deployed (or to be deployed models)? Typically offline and online scenarios might warrant for different quality and testing methodologies. A nice workflow for this is mentioned in Netflix’s paper, ‘Mining Large Streams of User Data for Personalized Recommendations’.

Testing Online and Offline Models (Source: Netflix)

Model metrics, hypothesis testing and A/B testing are key tools you can leverage for quality checks. Besides this, you should also not neglect the aspect of operations. This means ensuring proper model performance evaluation monitoring systems are deployed so you can track model performances over time on live data. Remember the project doesn’t end once the model is built and evaluated. Typically having a continuous integration and delivery framework (CI/CD)also helps with regard to continually making sure your codebase is working fine and doesn’t break suddenly without anyone knowing about it.

Conclusion

I hope this article has given you some perspective on typical real-world challenges in the industry with regard to executing data science projects. The recommendations presented here are just some general guidelines and are not set in stone. Since data science has evolved to become more of a cross-industry discipline now, each of you might find out specific aspects which work better for you but the overall set of challenges to keep in mind should be in this article. I sincerely wish all of you success in executing your own data science projects and don’t hesitate to tell me what works best for you and if there is anything I forgot to mention here!

What’s Next?

We have a lot of interesting content planned for the Data Science Strategic Guide Series! In the next article, we will be looking at how to build effective data science teams based on perspectives from industry experts, enterprises and your friendly neighborhood author — myself! Stay tuned for some interesting content!

Have feedback for me? Or interested in working with me on research, data science, artificial intelligence or even publishing an article on TDS? You can reach out to me on LinkedIn.

--

--