Build Data Science Center of Excellence to Manage the Complete Model Life Cycle

Yu Zhou
Towards Data Science
4 min readMar 17, 2018

--

This wave of data technology revolution remains strong and companies of various sizes are actively building/expanding their Data Science teams. In searching for the best way to organize Data Science talents, a hybrid organization model that consists of one central platform Data Science team and multiple specialized teams embedded in business units has emerged as the best strategy and the proven strategy. This article below published on Towards Data Science explains the hybrid model very well.

With no standard name yet, the one central platform Data Science team can be called the Center of Excellence, because they disseminate good practices and contribute to data infrastructure. Although this hybrid model has enabled many companies to achieve great return of investment in Data Science, much more potential exists for us to push this organization model onto a higher level.

One potential area I see is to grant the “reviewer” role to the Center of Excellence, allowing them to become more than an enabler. Going from model poor to model rich, many new models are created and used everyday in various teams of organizations, sometimes low quality models get shipped and lead to bad consequences afterwords. We now need a trusted place such as the Center of Excellence to take a hard look at models round us, especially the mission critical ones.

All Models Are Wrong

In one of my work project, I was evaluating a model created by my colleague. I found out that one input variable was barely contributing to our prediction. I compared the prediction results with and without that input variable, and consistently got almost identical results. Because this input variable was the most computationally expensive feature in that project, I proposed to remove that variable for the sake of good modeling/engineering practice. My colleague suggested to keep that variable, because one senior leader of our team conceived that feature. It was politically favorable to keep it. In the end, the feature was kept.

Including an non-informative variable could be no big deal, however, deviating from the pursuit of excellence is not good. We may walk into serious consequences when flawed models get deployed and when decisions are made on wrong data. George Box famously said, “all models are wrong, some are useful.” Models have strengths and weakness to be carefully assessed; mistakes likely happen in today’s very long and complex model life cycle.

Words and critiques from trusted reviewers can help us better assess the state of models: if the data quality is acceptable, if the model assumptions and selections are valid, if the performance metrics are thoroughly tested, if everything is documented and reproducible, if the business interpretation consistent with the formula. When office politics take effect, the role of reviewers become even more valuable. This HBR article below says, “analysts who are too deeply embedded in business functions tend to be biased toward the status quo.”

The Center of Excellence

One day, a collaborator told me he built a model that performed very well. Knowing his development was complex and barely reproducible, I found myself hard to believe his statement. If he got a thumb up from a third party reviewer, I would immediately know something great really got created. In software development, there are people developing and there are people testing. Data Science community lacks such a practice just yet (please let me know if I am wrong).

Giving the reviewer role to the Center of Excellence is directional, how to operate the review in real world depends on teams, leadership and culture. Maybe multiple teams should shoulder this role together: data people, domain experts and engineers. Nevertheless, I think we want to institutionalize high Data Science standards in order to maximize Data Science impact and prevent undesirable outcomes. This step will make the current hybrid model more connected and governed.

Take one step further, model life cycle management is an open space now. How to register, monitor and retire Data Science artifacts (being a model, a feature, a package, an extraction) need thoughtful strategy, strong leadership and community support. The reviewer role discussed above is only one aspect of model life cycle management. The Center of Excellence should has a role to play in this open space.

--

--

Data Scientist @Cloudability, I am interested in everything data science, all my posts are my own. Twitter: @yuzhouyz