The world’s leading publication for data science, AI, and ML professionals.

Want to Create More Impactful Data Science Projects for Your Portfolio?

Build hot data science, not cold

In Data Science as in life, being hot is typically preferred to being cold. After all, cold is associated with old things, death, and is literally defined by a lack of [molecular] motion. Hot on the other hand conjures images of fire, sweat, and motivation.

Yet, despite this very clear realization on the attractiveness of hot over cold, most data science projects end up as cold, static projects where models are trained and evaluated then reported on and saved off. Only to be forgotten in the recesses of the internet or our own personal hard drives.

Kaggle competitions work like this. We train models on data sets that are provided by whoever is hosting the competition (e.g. Netflix, Facebook, etc). Those models are then evaluated against other models according to an outcome metric. The best outcome metric = the winner. That’s it. Done.

The data remain cold, so the projects are cold too.

Okay, I guess in some cases the model winner may be able to deploy their model for the sponsoring business, but these cases are rare, and happen outside the competition. Thus, we never get to evaluate model performance on new and incoming data. We never get to determine the true generalizability of these models.

Consider this scenario: two models are built to predict a business outcome. One model is built using a train-test-validation split of the training data. The features are picked based solely on their statistical significance in the model (or even worse, no feature selection process is used).

The second model is also built on a split of the training data as well as the inclusion of features derived from the scientific literature, though not apparently statistically significant in the training data.

The first model outperforms the second on the training data. It should. After all, it was optimized with the training data.

The second model however performs better on future data. Why? Because it included features that were not discoverable by the first model but were known to be predictive of the outcome based on scientific research. Thus, although said literature-derived features were not discoverable in the training data (e.g. statistical significance, unavailable due to lack of feature engineering, etc), over time, and with the inclusion of new data these features remain important in accurately predicting future outcomes.

In either case, learning how to build models on cold data is one thing but learning how to create solutions that can respond to incoming, hot data adds a whole new set of skills to your toolkit that can make great additions to a developing data scientists project portfolio.

In this article, I examine the distinction between cold and hot data science projects and uncover some useful steps and considerations for turning your data science projects into hotter portfolio commodities. Interested? Read on then!

Hot Data Make Data Science Projects Cooler

The distinction between hot and cold data is certainly not new to the tech space. Indeed, Amazon’s AWS offers an S3 storage service labeled "Glacier." I mean, how cold can you go? And now introducing…ZeroK storage! But I digress.

In terms of data, hot data are data that are used often, refreshed regularly, and/or needed quickly. As a result, hot data are also often more expensive to store because the compute requirements differ from what is required of cold data.

At a minimum, hot data science requires setting up a process that responds to hot data. That is, making our data science portfolios cooler (slang intended) means setting up hot data stores that allow us to score models on new data.

But, developing a hot data science solution requires more than just hot data stores. Hot data science requires the use of different methods, methods that are often left up to MLOps teams but would still benefit from a data scientist’s expertise in setting up.

For example, hot data science needs to include evaluation metrics that continuously monitor model performance on new and incoming data. And what do we do when we observe model drift in our metrics? Indeed, monitoring model performance and looking for model drift is only part of the hot data science problem.

A truly hot data science solution is one that allows for new models to be built and trained as new data and new features become available. The savvy data scientist may recognize that this approach is the basis of the champion-challenger method. Challenger models that are built with new and incoming data ensure that we not only address model drift but that we also have a mechanism for improving our models in a continuous fashion.

As your skills in setting up hot data science improve, you may even identify opportunities to collect data back from customers, users, or whatever that can be used to further train new models. Such scenarios are good for setting up retraining protocols as the amount of available training data expands or for setting up online learning using reinforcement algorithms.

The possibilities are endless, but where does one start to demonstrate this skill set in their own portfolio of projects?

In the next section we look a bit deeper at some ideas for setting up your very own hot data science project to add to your portfolio.

Create data pipelines with hot data

As mentioned, a hot data science project at minimum requires setting up a hot data store pipeline. There are lots of ways to do this in a demonstration as part of a project portfolio. First, you need to identify a problem you want to tackle that can be addressed with hot data. And what greater source of hot data than the internet?

Say for example you are a content creator, and you want to see what topics are hitting the news each week to provide you with a strategy on what to focus on in your own content. One approach might be to set up some email alerts from Google Alerts that email you the top new stories related to your topics of expertise. Some python code with a few tweaks to the email service can be deployed once per week to pull out the results and store them in a simple SQLite database on your local machine.

Setting up such a service will require you to also identify how to make the code automated. Some ideas for setting up and scheduling code include using the Window Scheduler on Windows machines like here, or using Cron on Linux machines as explained here. For more advanced orchestration, check out Airflow.

Once you have a way to store and update the data store with new data, the next step is to build out the data science that you will use to score on the data. You can use a pre-trained model to tag news sources with known tags of interest or you can use unsupervised topic modeling methods to discover key topics each week.

Build champion-challenger scenarios with incoming data

Once we have built a pipeline that collects data on a scheduled basis, scores the data with a supervised or unsupervised model, and delivers those results to us via some mechanism (e.g. a set of saved images, a data set, a dashboard) the next step is to set up a process that can learn from new data.

In the case of unsupervised models such as topic models, we may want to identify if a new topic comes up. Alternatively, in the case of a pre-trained supervised model we may want to ensure our model is still accurately scoring new data. In both cases we need to build in the capability to train a new model as new data come in and compare it to the existing model.

Conclusion

In this brief article, I examine a few ideas to inspire creating more exciting and impactful data science projects by focusing on the use of hot data. Notably, hot data science goes beyond just hot data because it involves the inclusion of additional methods and considerations in the overall solution pipeline. Moreover, hot data science is more exciting because it demonstrates how we see data science as a living and breathing process that can use methods derived from its own discipline to create smarter data outputs that inform our own living and breathing needs.

Like engaging to learn about data science, career growth, life, or poor business decisions? Join me.


Related Articles