The world’s leading publication for data science, AI, and ML professionals.

Always Remember Data Comes Before The Science

Different Ways to Aquire Data

Photo by Clay Banks on Unsplash
Photo by Clay Banks on Unsplash

Without Data, we can definitely forget about doing the Science part. Yet, it amazes me how little we speak of data – Don’t worry, I am not pointing the finger. I’m guilty too; Data is at the heart of everything we do as Data Scientists, but it’s not as fun as talking about how BERT is pushing the boundaries for natural language processing (NLP) tasks, or whatever the new state-of-the-art (SOTA) architecture is.

This inherent FOMO (Fear of missing out) of what’s to come in Artificial Intelligence (AI) and Data Science is similar to the thing that has so many people hooked on social media.

It’s not as fun to talk about the things that really matter in the present when we can daydream about how bright our future could potentially be with new, cool stuff.

At least this is my little theory as to why people in data fields do not give as much attention to the things that really matter. THE DATA!

It may sound crazy, but did you know that most bottlenecks in industrial projects occur as a result of the data? Don’t even let me start on the ethics side of things.

Maybe I live under a rock, I don’t know. All I am saying is that I rarely see practitioners in data-related fields discussing the actual acquisition of data. It’s an insane observation because the data is actually one of the most important parts of the entire workflow – If it is a scenario where I’ve been looking in the wrong place, I am happy to be pointed in the right direction.

Note: I suggest everybody reading this article puts Coded Bias on their watch list – you can find it on Netflix.

What is Data Acquisition?

Data Acquisition, according to _[Wikipedia](http://Wikipedia)_, is described as "the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer" [Source: Wikipedia].

In an ideal world, we’d have all the required datasets with enough data points for our problem (the actual amount could vary per project). In such a case, we wouldn’t have to think about Data Acquisition. However, the world is typically not ideal, therefore, it’s imperative we have techniques in place to acquire data before we could even think about putting the data in a format that’s ready to be processed.

Data Acquisition Techniques

Before we do any data acquisition, we need to know what the physical phenomenon that we wish to measure is (i.e. light intensity, house prices, force, etc.). AI teams would typically have lengthy discussions about what the business problem is, its different requirements, and what type of data they’d require before any practical activity begins.

Once the horn sounds to begin work, all data sources would need to be pulled into a location that is accessible to the Data Scientists working on the problem.

In the remainder of this article, I cover different ways to collect data when the ideal scenario doesn’t occur.

Public Data

Sometimes it’s possible to find public datasets that we could leverage for our tasks. Here’s a list of some places to find free public datasets:

If you’re able to find a dataset that’s suitable and similar to the problem you are solving then great! You can build a model and evaluate it.

Data Scraping

If there aren’t any public datasets readily available to use, it’s likely we could find some relevant data source on the internet – For example, a FAQ or a discussion forum which people use to interact with a business. We could scrape that data and then get human annotators to label it for us.

However, in many industrial settings, this strategy for gathering data from external sources does not suffice since the data will not contain the nuances a typical product would have (i.e. product names, product-specific user behavior). This means that our external data may be very different from the data that would be seen in a production environment – A perfect recipe for disaster.

Note: There are a few gotchas to note whenever you decide to work on web scraping tasks – Read more about this on Web Scraping Challenges. Additionally, you will always want to make sure that you respect the websites that you are scraping from by checking the robots.txt file of the website you wish to scrape as well as the terms of use (this is generally the one that is more restrictive hence it’s very important you check this).

Product Intervention

In the real world, AI models rarely ever (if they ever do) exist as a solo act. AI models are typically developed to serve users via a product or feature. Therefore, AI teams should make it imperative to ensure they work with the product team to collect more data and richer data which is done by developing better instrumentation for the product. This strategy is referred to as product intervention in the world of Technology.

Generally speaking, product intervention triumphs when it comes to building AI applications in an industrial setting.

Data Augmentation

The thing about the majority of these techniques is that they take time, and in the corporate world, time is money. Take for instance the idea of instrumenting products to collect data. If we were to instrument some product today, It could take between 3 to 6 months to collect a decent enough dataset. To overcome this problem, we could perform some Data Augmentation techniques on our small dataset to create more data.

Data Augmentation in data analysis is a set of techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data [Source: Wikipedia].

In a future article, I will cover some ways to augment your data for NLP tasks.

Final Thoughts

In order for the data acquisition techniques listed to work, we first must ensure that we have a clean dataset, to begin with – even if it is a very small one. It’s not uncommon for data to come from heterogeneous data sources, which could mean our early-stage production model is developed using a combination of public datasets, labeled datasets, and augmented datasets since we may not have a large enough dataset for our custom scenario in the beginning.

Thank you for reading!

Connect with me on LinkedIn and Twitter to stay up to date with my posts about Data Science, Artificial Intelligence, and Freelancing.

Related Articles

It’s About Time We Broke Up Data Science

4 Data Related Books I’ll Be Reading In April

Deep Learning May Not Be The Silver Bullet for All NLP Tasks Just Yet


Related Articles