
Whether you’re brand new to Data science or the Chief Data Scientist at a large organization, you’ve probably played with perfectly crafted data sets to solve toy machine learning problems. Maybe you’ve used K-Means clustering to predict flower species in the Iris data set. Or maybe you’ve tried out a logistic regression model to predict which passengers survived the Titanic voyage.
While these data sets are great for practicing the basics of machine learning, they don’t mirror the real-world data you’ll come across on the job. In reality, your data can have quality issues, might not be perfect for the task at hand, or may not exist yet. This means Data Scientists often need to roll up their sleeves and gather data – a challenge often not covered in today’s data science curriculum.
For new Data Scientists, collecting extensive amounts of data before diving into the problem at hand can feel extremely daunting since this stage lays the foundation for the entire machine learning project. However, with the right strategies, this process can become much more manageable.
Throughout my 10+ years as a Data Scientist, I’ve encountered a wide variety of Data Collection strategies, and in this article, I’ll share five of my favorite tips to optimize your data collection process and set you on the path to creating a successful machine learning product.
1. Transform Data Collection Into Immediate Value for the User
A powerful starting point lies in offering tangible value right from the beginning. Let’s borrow an example from a major player in the automotive industry, Tesla. Their quest for a fully autonomous vehicle is a substantial goal that’s taken years to develop and has required a massive amount of data collection.
So, what did they do while amassing all of this data?

To make this data collection immediately valuable, back in 2018 they released an automatic windshield wiping system – not exactly a self-driving car, but an incremental benefit for users.
This is a great example of a company taking an agile approach by emphasizing incremental product development: think skateboard, then bike, then motorcycle, and finally, the car. Embracing this mindset lets you provide immediate value while simultaneously collecting data for larger projects.
In a recent project I worked on, our goal was to automate customer support email responses. However, to train a Machine Learning model to respond, we needed to craft a labeled data set.
To tackle this challenge, we mined the client’s existing emails and created template responses. They then used these templates to respond to new requests which not only saved the client time but also helped to collect necessary labels to train a machine learning model.

It’s vital to quickly demonstrate the value of data labeling to the labelers themselves, as this task can be both tedious and time-intensive. In addition, it’s a chance to showcase an early return on investment (ROI) to upper management – a task that’s notoriously challenging in machine learning projects.
2. Make Data Labeling Invisible
Tesla ingeniously makes the process of data labeling for self-driving vehicles almost invisible. As you turn the steering wheel or push any of the pedals, you’re unknowingly labeling a dataset. In 2020, Elon Musk stated that
Essentially, the driver when driving and taking action is effectively labeling – labeling reality – as they drive and [make] them better and better.
But how can this be applied elsewhere?
Returning to our email automation example, we could have had the client categorize all emails so we could then directly pull the categories, but adding an extra step to their process wasn’t feasible. Instead, we subtly labeled the data set using the email templates, making it a seamless and natural process.
As the client began using the templates, they quickly needed to customize them which led to a challenge for us. We were labeling the data based on the original template text and now needed to track multiple versions of each template.
A creative solution to this problem involves adopting a watermark system similar to Genius’ approach for identifying their scraped content on Google.
By invisibly watermarking each template, we could uniquely label the data set without disrupting the user experience. Ideally, the system would capture the template used, but our current setup lacks this feature, making watermarks a cost-effective and creative alternative.
3. Offer Multiple Defaults to Users
As a Data Scientist, whether you’re crafting a new product or integrating machine learning into an existing one, leveraging existing production systems for data collection is key. This not only uncovers valuable user insights but also optimizes your approach.
One powerful way to uncover user insights leverages the psychology of user behavior. Users frequently gravitate toward the default option due to its convenience and familiarity. When presented with a choice, the default serves as a starting point that requires no additional decision-making effort.
Take for example the default web browser on an operating system. Most Apple users choose Safari, Windows users choose Edge, and Android users choose Chrome. Opting for an alternative browser reveals more about a user than sticking with the default.

Providing multiple defaults based on user interest enriches our understanding and allows us to group similar users. Think of apps like Pinterest, Netflix, or Twitter that ask for your interests during sign-up.
The data collected through these tailored defaults serves as a valuable resource for training machine learning models. These models can learn from users’ preferences and behaviors, enabling platforms to produce more accurate recommendations, predictions, and insights.
A few years back, I was on a project aiming to categorize documents. As I dug into the data, I quickly realized that users from different regions had their own unique ways of categorizing documents. So, I ended up building separate models for each region to fine-tune the predictions. This approach allowed me to incorporate users’ preferences and behaviors into my models and ultimately provide them with more accurate recommendations.
4. Carefully Select the Data to Label
While it might seem logical to attempt to label all available data to train a machine learning model, this approach is often inefficient and can lead to diminishing returns. The humans labeling your data have limitations in terms of time, attention, and expertise.

Labeling everything could mean spending effort on data points that are already well-understood or less informative. Additionally, the cost and time associated with labeling large amounts of data could outweigh the benefits of improving a model’s accuracy.
Instead, consider a selective approach like active learning to allow annotators to focus on the most valuable data points. Active learning is like having a smart student who knows which questions to ask the teacher to learn better.
It helps a machine learning model learn faster and perform well with less overall data. Just like a student learns by asking the right questions, active learning helps models learn by picking the most informative data to label.
In a recent project, I had a limited window of a couple of analysts’ time to label data for a model that would predict a document’s retention policy. To make the most of their time, I used an active learning technique called uncertainty sampling. This technique pinpointed data points the model struggled with, allowing us to focus on the most challenging examples and optimize the labeling process.
5. Leverage Historical Data – Look Back to Move Forward
As you collect labels from users, you can compare them with historical data to uncover relationships. You might even create a predictive model to infer labels from this historical data.
The model learns from patterns it identifies in the historical data and then applies this learning to assign labels to new, unlabeled instances. This not only expands the value of your historical data set but also accelerates the process of collecting a labeled data set.
Another powerful technique involves clustering your labeled data alongside historically unlabeled data to pinpoint the data that most needs labeling. Identifying clusters of data with no representation in the labeled data set ensures that you’re making the most of your limited resources.

Returning to the document retention policy model from earlier, I combined clustering and uncertainty sampling to choose an even more targeted sample for the analysts to label.
First, I grouped similar data points using clustering, and then I predicted the retention policy and confidence scores using a model. From there, I sampled 5–10 data points with the lowest confidence scores and sent this more focused data set to the analysts to label.
Conclusion
Embarking on the journey of data collection for machine learning projects might appear daunting, but it’s a challenge that can be conquered using the five strategies I’ve covered.
Summary of the Tips
- Transform data collection into immediate value for the user
- Make data labeling invisible
- Offer multiple defaults to users
- Carefully select the data to label
- Leverage historical data – look back to move forward
By weaving immediate value, seamless experiences, personalized defaults, selective labeling, and historical insights together, you’re not just collecting data – you’re setting yourself up for success.
Ensuring the model is trained on relevant, diverse, and representative data is key for reliable predictions, and optimizing your data collection process saves time, costs, and accelerates successful machine learning deployments.
References
- https://en.wikipedia.org/wiki/Iris_flower_data_set
- https://www.kaggle.com/competitions/titanic
- https://electrek.co/2018/01/01/tesla-releases-automatic-wiper-update-beta/
- https://www.pwc.com/us/en/tech-effect/ai-analytics/artificial-intelligence-roi.html
- https://electrek.co/2020/04/30/tesla-fleet-training-orders-of-magnitude-better-elon-musk/
- https://www.pcmag.com/news/genius-we-caught-google-red-handed-stealing-lyrics-data
- https://archive.uie.com/brainsparks/2011/09/14/do-users-change-their-settings/
- https://link.springer.com/article/10.1007/s10462-022-10246-w
- https://towardsdatascience.com/uncertainty-sampling-cheatsheet-ec57bc067c0b