The world’s leading publication for data science, AI, and ML professionals.

Is The Data Science Profession At Risk of Automation?

Can Quality Forecasts Really be Produced On-Demand? And What Does that Means for the Data Science Profession?

Photo by Alex Knight from Pexels
Photo by Alex Knight from Pexels

The other day I read an article about how Uber wants to bestow the power of Data Science on every one of their employees. The following quote in particular stuck out to me:

"The grand vision of the forecasting platform is to provide forecasts at the push of a button. Absolutely no forecasting expertise [is] required. The only input that’s needed from the user is historical data, whether it’s in the form of a CSV file or a link to a query, as well as the forecast horizon. How far do you want to forecast? Everything else is done completely underneath the hood." – Franziska Bell, Uber’s Director of Data Science

That got me thinking, can forecasting really be commoditized to such a degree? There have also been efforts by Microsoft, Google, and Amazon to make their Machine Learning solutions more "drag and drop" for their respective cloud clients, so Uber is definitely not alone in its aspirations.

Bell’s quote leads to 2 contrasting conclusions – either Uber’s forecasting platform is super amazing or they are being too cavalier about the challenges of forecasting the future. For fun, let’s run through each possibility:

Uber’s forecasting platform is amazing – what must Uber be able to do to produce forecasts at a push of a button, where the only required input is historical data of the target variable? They must be able to:

  1. Have data on and know whether or not to include any and all relevant features. You need exogenous variables to build a model especially when you are trying to forecast something complicated. Not only must Uber have all the data readily available in advance of producing a forecast, it must also know which features to include and how to transform each feature.
  2. It must also be able to compare and contrast various forecasting algorithms (linear regression vs. random forest vs. neural nets). And be able to choose the optimal hyperparameters for each specific algorithm.
  3. The forecast must also be backtested (to mitigate the risk of blowing up when you take the model out of sample) and Uber needs to be able to communicate to the user what assumptions the model relies on and under what conditions it might break down.

That’s a lot to do on the fly. And if they can do it all then kudos to them.

But what if Uber is being too cavalier – the opposite view is that the forecasting platform is just an ARIMA model or a LSTM that forecasts the future based on past observations of the target. For certain applications this is OK.

But using only lags of the target variable as features means potentially missing out on critical exogenous relationships, which would leave the model severely under-fit and liable to perform poorly.


Photo by monicore from Pexels
Photo by monicore from Pexels

Take it With a Grain of Salt

Personally I am skeptical of Uber’s "Forecasting as a Service" goal. I could understand if Uber was allowing its employees to forecast certain key business metrics "on demand" using prebuilt models— models that have been extensively researched and refined by their data science team. But I don’t think this is what Franziska Bell means. It seems like her objective is to be able to create a forecast of nearly anything and everything at the push of a button.

This is a really hard problem, potentially an impossible one. Let’s walk through each step of the forecasting process to get a better sense of what can and cannot be easily automated.


Clearly Defining the Problem – What Needs to be Forecasted?

Without a problem to solve, there is not much of a point to building a model and making forecasts with it. So the first step is to figure out what is my problem and which aspects of it can I forecast to bring more clarity to the issue?

This is often less obvious than it seems at first glance. Since we started with Uber, let’s continue to use it as our example. Say we are analysts at Uber and our job is to forecast San Francisco Uber demand over the next year. Can we just give the forecasting platform the historical time series of Uber demand and be done with it?

Probably not. I mean what does our boss even mean by demand? It could be any of the following:

  • The number of riders over the next year.
  • The number of total rides over the next year, which would be riders multiplied by number of rides per rider.
  • The number of dollars paid by riders over the next year, which would be riders multiplied by number of rides per rider multiplied by average price per rider.

So there is ambiguity around what exactly we need to forecast. And did you notice that we needed to forecast progressively more variables as we fleshed out the definition of demand?

Even just the number of riders itself is an interplay of many factors:

  • The number of available drivers – the number of drivers and number of riders have a circular effect on each other where the more drivers Uber employs, the more riders will use its platform (this is known as a network effect).
  • How the competitive landscape (Lyft, taxis, scooters, etc.) shifts over time. This includes the number of competitors, the marketing and pricing strategy of each individual competitor, etc.

So what seemed like a simple problem ends up being quite complicated and hard to automate. As we saw above, proper forecasting models are often ensembles of multiple individual models and forecasts. If we don’t take enough variables into account, our model will miss out on critical effects. And if we try to include too many models and/or forecasts in our ensemble, we will get lost in a maze of complexity.

Figuring out what to forecast is not easy, and an experienced data scientist can be invaluable as an architect responsible for fleshing out the individual components of the model, so that it strides the fine line between being too simple and being too complex.


Identifying Insightful Data (and Finding It)

Once we’ve identified the variables we want to forecast and drawn a neat flow chart for our ensemble of models, we are ready to roll right? Wrong, first we need to figure out whether we have all the data that we need. In the rosiest scenarios, all of our data is available, cleaned, and ready to go in a database – but rarely do things work out like that in the real world.

Once we know what we want to forecast, then we need to decide our candidate set of features that we will use to produce our forecast. Oftentimes, this data will not be readily available – rather, it is the job of the data scientist to figure out where and how to get the data. And if it is impossible to directly observe, then how to proxy it with what is actually available.

This step is also difficult to automate. Unless a company’s data lake is as vast and deep as Google’s, then they will need data scientists to intelligently and creatively scour the world for insightful data.


Building the Forecast – Feature Engineering and Choosing the Right Algorithm

This is the part that is probably easier to automate. Assuming that we successfully obtained and cleaned all our data (not easy to do), we are now ready to build the model.

While I would argue that an experienced data scientist or statistician would be an invaluable expert in picking the right model and correctly setting its parameters, I am also aware that a brute force, automated approach is definitely possible here.

You could even argue that we don’t need to run and test every model in order to choose the best one. Rather we can just assume that using XGBoost or a neural net will give us a good enough result, provided that they are properly trained and not overfit.

Additionally, both of the aforementioned algorithms effectively automate the feature engineering process. For example, given enough neurons and layers, the neural network can easily capture any nonlinear effects between our features and the target. So there is no need to explicitly include logs and exponents of or interactions between our features.

Of course, there is a price to be paid for this automation. Low interpretability – in other words, we have no idea what is driving our forecasts. For example, whereas in a linear regression, beta coefficient A tells us the the exact impact that a 1 unit increase in Feature A will always have on our forecast; in a neural net we have no idea how increasing Feature A affects our forecast.

In today’s world of big and complicated data, model interpretability seems like a nice-to-have rather than a must-have. But I would argue that in cases where a simpler and more interpretable model does not cost you much (in terms of the accuracy of your predictions), it’s advisable to keep it simple.


Photo by Skitterphoto from Pexels
Photo by Skitterphoto from Pexels

Knowing When Your Model Might Break

An under-appreciated risk of giving everyone the power to forecast is that people without prior forecasting experience lack a healthy respect for the havoc that an invalid or overfit model can wreak.

Behaviorally, when we see a quantitatively precise forecast, we are lulled into a false sense of security (we are comforted by the precision of the numbers and math). But an experienced data scientist would know to question the assumptions of the model and recognize under what conditions the model is likely to perform poorly.

This is another drawback of an uninterpretable model –

If we can’t see the key relationships that drive our forecast, it’s hard to know when we are in an environment where those relationships are no longer valid.

In my opinion, this is extremely hard to automate. There will always be a job for someone who understand both the benefits and risks of building models and producing forecasts.


Conclusion

Everything that can be automated seems like it eventually will be. Thus, we should not be surprised when certain aspects of data science and machine learning become automated at some point. Rather we should focus on the aspects of data science that are hard to automate and will continue to add value for the foreseeable future:

  • Understanding the key drivers of your business as well as what factors impact each of those drivers.
  • Knowing how to properly scope and design a model so that it is neither too simple and under-fit or too complex.
  • Knowing how to dig up insightful data that can be used to feed data science models.
  • Building interpretable models that are also "good enough".
  • Being able to identify when and under what circumstances your model is likely to break down and produce poor forecasts.

Of course, these are just my thoughts. I am interested to hear yours as well. Cheers!


More Data Science and Analytics Related Posts By Me:

Got Data Science Jobs?

_Understanding PCA_

_Understanding Random Forest_

_Understanding Neural Nets_

_Understanding A/B Testing_

Fun with the Binomial Distribution


Related Articles