OPINION

3 Data Foundation Problems That You Should Fix In 2021

To strengthen your company data foundation, let’s do it

Pathairush Seeda
Towards Data Science
8 min readDec 15, 2020

--

Photo by Debby Hudson on Unsplash

Data analytics, science, and engineering have grown much popularity in the last few years. It creates a new standard for the industry. Every company needs to invest or establish a data office within their organization.

It becomes standard in 2020 that you can have a prediction model for marketing leads, improving your check-in method with facial recognition., or looking at the elegant dashboard for making a business decision.

Exceptional use cases always come first to build the momentum of the analytics trend. Executives want to see a result before investing a massive amount of funds into a new direction.

The technical problems are hidden under those use cases. When we are doing the analytics alone or with few people in the group without a proper working standard, it is easy to make a mistake without noticing.

After a few years of working in the data field, 2021 would be a good year to fix the data problems in our daily life. So, here is the list that I have faced since I started the data journey.

Data quality

When it comes to data science or data analytics, we usually talk about Machine learning, Predictive model, or KPI Monitoring dashboard. We seldom talk about the data quality issues underlying those products.

We can see the trend of the following words in the Google trends. It is pretty obvious that even in 2020. The data quality is still suppressed under other topics.

Fig1 — Google trends for keywords Data quality, Data science, and Machine learning, Image by Author

Since the starting of 2020, the topic I was asked the most about data is that “Is the underlying data of this report/dashboard corrected?”. I always have to summarize the number and compare it with the operation team's existing report to ensure its validity.

As a result, the first problem you should tackle in 2021 about the data is Data quality. Here are the sub-topics you should take a look at first.

Have you already had a data quality checking system? If not, this is the best time to think about it. I read the article about Data quality at Airbnb. I admire them so much for rebuilding the architecture at scale. I plan to use it as a best practice to strengthen my current company data foundation.

Monitoring

The first element I would like to point out is the data quality monitoring system. The ideal product would be the dashboard showing all the critical data quality metrics.

Photo by Adeolu Eletu on Unsplash

To help everyone in the organization knows about the current quality of the data, the centralized dashboard or data quality newsletter are the answer.

Firstly, we can build the data quality system based on the smallest element, such as metrics. And we can start with crucial domains like timeliness, completeness, and correctness of the data.

The metrics could be derived from the existing data. For example, the percentage of data completeness or how many tables come late than the SLAs agreement.

After that, we can put those numbers together on the dashboard to show our data quality's current and historical trends.

Besides, the dashboard can publish attached with the metadata to give all the necessary information in one place. It will give the whole organization both trustworthiness and comfortable feeling when using the data.

Lastly, this is the phase that you should remember to help you have an awareness about the data quality issues.

Garbage in, garbage out

Alarming

In addition to the monitoring, the system should catch the errors and alarm the data owner and related parties to fix the problem.

This should be done proactively rather than waiting for the user to raise the issue.

The alarm notification can categorize to many levels, like when you are logging the step in the code. It can range from debug to critical. You should align the SLAs with the alarming’s level.

It will be so sad if the data goes wrong without nobody notices.

To measure the performance of the data quality system, the number of errors catching by the system should be used as a performance indicator for data owners and related parties.

An excellent example of alarming the problem with the data is to make a certification for each data set. Here is a good example from Airbnb the MIDAS certification. The certification explicitly shows the data consumer that the data is in good condition.

ML Deployment

Photo by Vadim Sherbakov on Unsplash

The usual product of data science is a predictive model. When starting the journey three years ago, I delivered the prediction model's result in a CSV file format. And the model is saved in a pickle format for manually scoring by a data scientist.

That’s a traditional way to get things done.

Luckily, the technology has developed so fast that today we have many libraries/frameworks like MLflow or others for deploying machine learning models.

At my previous company, the engineering team also developed the platform for automating predictive model scoring. They combined the latest framework mentioned above with the in-house knowledge to create a unique platform.

It makes a data scientist's life easier. It contains several useful features and reduces the step for data scientists to deploy the model.

It takes care of both feature store and predictive model’s metrics. But it’s still has a gap to improve. Here they are.

Monitoring

The deployed model needs to be monitored. The performance of the model can change for many reasons. A good blog from Databricks described how many problems you can face after deploying the model. I highly recommend you to read it.

In summary, there are three issues that can make the prediction looks weird after model deployment
1) Data diff, 2) Concept diff, and 3) Upstream data changes.

Those problems can silently affect the prediction result of the model. We should plan to prevent it because our model will be used for something valuable like recommending the product to the end customer.

If the model goes wrong, it can make a huge loss to our business.

The consistency of the prediction result is also important. For monitoring it, there are two parts to be concerned here
1) Data, and 2) Model

We need to ensure the quality by implementing the data quality checking for the data side as in the previous section. The system should alarm the data quality defect first before you know the deviation/defect from the model prediction.

Also, we can use a library like MLflow to track the model metrics such as AUC, Precision, Recall, or others for the model part. Then, we can set the threshold to trigger the alarm if those metrics deviate from the current standard you set.

You can trackback the reason behind the changes with the help of a data quality checking system.

Finally, a dashboard or report can be created from logging metrics. It helps tracking the metrics like the number of models that meet the standard performance or the changes in prediction distribution from the model.

Re-training model

Here is another problem from my experience. When you use the model for some period, the longer time the model is used, the more likely it becomes stale.

You need to have a re-training feature for the ML deployment system. Otherwise, there will be a lot of hard work waiting for data scientist at the end of the ML pipeline.

The re-training model is to add more new data or shift the period of data you use to make the model adapt to the customer's current behavior.

Pro tips : It’s hard to define when to re-training the model. You can make a simulation of the several re-training periods and see when the model performance drops.

Testing

The last problem about the ML deployment is testing process. Testing usually is a must thing to do in software development. We expect the system to reproduce the same result every time.

But it’s hard to do in ML deployment because the model is figuring out the information from the uncertainty in data, it’s difficult to make a reproduceable result after you fitting the model.

We can do our best by testing other aspects of the deployment.

For example, the type of input data and output prediction, the unexpected behavior in results such as null value or the prediction that’s doesn’t make sense.

We can generate the test case for each category of the model like classification, regression.

This helps improving the model output quality and lowers the chance of error for the consumer side system (downstream part) like the front end to display its prediction value.

Data governing

Photo by Arisa Chattasa on Unsplash

The last topic I will go through is data governing. We have to govern what will happen with the data, whether it’s an update or delete from the system. We have to grant the right permission to the right person. If not, it may have a huge loss happened to our company.

For instances, you can imagine that if some employee secretly sell the company data to get their own profit. That will make a company lose its reputation and creditability.

From my experience, I have seen the case that the employee still use the flash drive or external hard disk to transfer the data. It is convenient to do it that way, but it would be a terrible choice to think about security.

To solve this problem, we can apply the principle of least privilege. It’s the same concept when the IT department permits us to access the company’s server. Each person should get the least privilege level for their needs. We should do it the same with any data.

If you still don’t have the data governance framework in your company, it’s time to think about it sooner.

Conclusion

Photo by William White on Unsplash

We review several topics today. I think these are the topics we need to focus on in 2021.

The more we do the data analytics projects, the higher number of times we will receive the same question about the mentioned topic.

It would be best to show the user the standard/best practice we build to ensure our validity and reliability.

It is a time-consuming task to do, but it will worth every second you spend on.

Also, this problem is not something we can find the answer to on the internet. Can you imagine yourself find the solution to build a practical data governance framework on the internet? It’s hard, right?

But we can make it easier by sharing our best practice/knowledge back to the community. I give kudos to those analytics blogs in Medium like Uber, Airbnb, or Netflix. It is inspiring and being a good practice to start implementing something useful.

Let’s make our community better together.

Pathairush Seeda

if you like this article and would like to see something like this more.

--

--