Why the year 2020 will prove to be a headache for Data Scientists

The effects of coronavirus will ripple through data science projects

Usman Gohar
Towards Data Science

--

Photo by Aaron: Unsplash

“Your model is as good as your data” is the most basic postulation in data science. Good data equals a good model! The coronavirus has impacted millions of lives around the globe, wreaked havoc on the airline industry and shattered equity markets globally. Depending on how quickly it is brought under control, the coronavirus, undoubtedly, will continue to affect the daily lives of many. Like everyone else, Data Scientists will also be affected. And no, I am not talking about work from home.
In economics, the term “Response Lag” is used to denote the time it takes for corrective measures to affect the economy once they have been implemented. Similarly, Data scientists will feel the effects of this giant anomaly with a lag, the length of which will depend on numerous factors.
So how will it affect data scientists?

Forecasting will be tedious

By now, most of the teams have realized that their forecasting models have tanked, at no fault of their own. No matter how robust models are, they will always be susceptible to anomalies. It is simply the business we are in. The core power behind data science is to predict everything except for random events. If your model can capture all the predictive information in the data except randomness, that is the best you can do.

But when 5 years down you are using data from 2020 and something is funny, remember one word; coronavirus

Industries like retail & energy, that rely on forecasting models, are scrambling to tell their investors they won’t be able to meet quarterly targets. These are short term problems that companies are dealing with.

In the long term, data scientists will have to contend with the skewed data of 2020. A typical forecasting model requires at least 12 months of data and more depending upon the application. This skewed data will find its way in every data science project that needs historical time series data that has been affected by the current situation e.g. in the travel, energy & financial industries, etc. This will most definitely impact the results of the forecasting models because this aberrant data is not simply an outlier you can whisk away.

Photo by Davisco: Unsplash

Is the whole data an outlier?

Traditionally, when some data doesn’t conform to the properties exhibited by the rest of the dataset, we simply tag them as outliers and kick them out of our models. However, considering that coronavirus is now a pandemic and has so far lasted beyond a couple of months, we can’t simply throw out the whole data by labeling it as an outlier. How do we handle this? How can we design our models in such uncertain situations? Or do we accept it beyond the scope of being a data scientist?

A point to ponder

Events like these will always be random and outside the control of the realm of data science. Though this raises an important point, how can data scientists cope with skewed data reflecting a random event, especially in forecasting? Will we have to wait for the data to build up again before we can look for clues to solve our problems? Or we can use earlier historical data with similar bleak situations? All of these questions require discussion and brainstorming. But this is not only limited to coronavirus. There are numerous events in history (think recessions, wars) that have similar effects on the data generated. Do we only use the data to read into that particular period? Or maybe use it as a special case? I am sure data scientists across the globe will have to answer these questions sooner rather than later. For now, this data is plenty rich to derive important insights from at a time of such uncertainty.

Stay safe and keep healthy!

About the Author

Usman Gohar is a Data Scientist, Co-organizer of Data Science Minneapolis and a Tech Speaker. He is very passionate about the newest data science research, Machine Learning and thrives off of helping and empowering young individuals to succeed in Data Science. You can connect with him on LinkedIn, Twitter, Medium & follow on Github.

--

--