Office Hours

Given the diverse range of skill sets required, Data Science projects are perfect prototypes for collaborative technical work. However, collaboration, in general, has become much more difficult as the world navigates the global pandemic and most people continue to work from home for the foreseeable future. Increasing restrictions on immigration and H-1B visas are exacerbating the problem, preventing companies from employing skilled workers from outside the U.S.
The net-net is that collaborative data science from a distance is likely here to stay. So how can enterprises adapt to this new normal? Here are five best practices that make distanced collaboration in data science projects work.
Set yourself up for success
Building models requires a lot of time and effort. Data scientists can spend weeks just trying to find, capture and transform data into decent features for models, not to mention many cycles of training, tuning, and tweaking models to make sure they work. Yet despite all this, few models ever make it into production. According to VentureBeat AI, just 13 percent of data science projects make it into production, and in terms of delivering value to the business, Gartner predicts that only 20 percent of analytics projects will deliver business outcomes that improve performance. Why the high failure rate?
Because, at the highest level, data science and the business simply aren’t connected. To ensure a higher chance of success, data scientists might start each new project by answering three critical questions:
- Do we have a business problem with a clear path to value?
- Is the problem feasible for us to solve?
- Can the business make the necessary changes resulting from data science insights?
Increasing the success rate of data science projects requires a collaborative partnership between data science teams and decision-makers to ensure that models are appropriate and can be adopted. By standardizing these questions before development starts, data science teams can build a repeatable approach that identifies value drivers for potential business problems, gain business support during the development of their models, and work with the business to increase the chances that their model results are adopted.
Architect and document collaborative projects
Executing a successful data science project involves a wide range of skill sets and job roles. Data science managers, data science teams, data engineers, deployment & validation engineers, IT administrators, data-to-business liaisons, and business customers are just some of the roles that play a part. Consider that you’re not only collaborating with people right now: You’re collaborating with people in the future.
Data science is not like software engineering; it’s probabilistic and more research-based, which means that not every data science will succeed. However, every project represents either a lesson learned or inspiration for a project six months from now.
One key to all of this is documentation. Documenting not only the business context, the deliverable, and the code, but also the process, intermediate goals and gates, key players, and key insights.
Further, establishing a set of best practices across heterogeneous teams allows those teams to collaborate, even though they may be working on entirely different projects. These best practices include having a common set of well-defined project stages and common directory folders in each project as well as more thoughtful mandates, such as a common pre-flight checklist that must be executed before starting a project.
Finally, slow down. Too often data scientists and data science leaders run their organizations in a mad dash for the end product, but they forget to build intellect, organizational knowledge along the way. One thing I like to say is that data science needs to operate more like a life sciences research lab: If you are trying to create a vaccine for COVID-19, for example, and you fail five times, you’re not going to throw what you learned in the trash, right? That’s golden information. You know what doesn’t work, and that can lead you to what is going to work in a more streamlined fashion.
Standardize knowledge sharing
So those five failures we just talked about? That’s called "prior art," and it’s one of the most valuable assets in innovation in general and in data science specifically. Without a way to ensure that everyone in your company can access this information, your data scientists are going to be reinventing a lot of wheels. Different teams will unwittingly reproduce the same model without visibility into the work that others are generating.
Providing an easy way to share knowledge – code, files, projects, and models – across business units and other siloes eliminates that potential redundancy and also facilitates learning. For example, if I need to build a predictive heart disease application, I should be able to search my organization’s internal knowledge repository for "heart disease" and be able to view every data science project anyone else in the company has done previously on heart disease. In addition, I should be able to see the data sources they used, the modeling approaches they tried, the delivery form factors they considered, the software environment versions they used, the insights they generated along the way, and much more. And even if those projects aren’t exactly the same as mine, I can likely learn a lot from them.
Create reproducible units of work
How many times have you thought, "I’d like to run that project that I created six months ago and integrate it into something I’m doing now," and you can’t find it? Or when you do find it, how often do the code blocks no longer run due to changes to package versions or operating system upgrades? Probably too many to count. That’s exactly why you need to make sure that you create reproducible units of work – and in four areas in particular: code, files, environments, and compute. Kubernetes and Docker are the building blocks that make this kind of reproducibility possible. In the hands of a well-orchestrated data science platform, experiments and models have full provenance and can be reproduced with ease.
Adopt clear validation and deployment frameworks
You’ve got your model created and ready to launch into the world, but there’s one final hurdle: code review. Simply making sure that code reviews are clearly defined and implemented is a good first step. Beyond that, implementing concise performance testing pipelines may seem like (and is) quite a bit of work, but putting in that effort on the front end will pay off in spades in customer satisfaction. This goes hand-in-hand with ensuring that projects meet service level agreements by conducting user acceptance tests.
Finally, monitoring your models to ensure that they actually work is incredibly important. Many times, models suffer from "data drift" in production: You’ve got the implicit data on which you created the model, but in production, your model is seeing new data feeds. Are those inputs drifting or changing the outcome? Using real-time monitoring and alerts can help you keep track of data drift and correct the model before it drifts too far afield.
Since it’s pretty clear that distanced collaboration is here to stay, why not bring your data science practice along with it as the world continues to rapidly evolve. Collaboration entails so much more than data scientists working together on a project; It involves collaborating with the business, proper documentation of code and processes, building a library of prior art, ensuring reproducibility, and architecting collaborative validation pipelines. Using the best practices outlined in this article can help you on your way to data science success.