Photo by @maxvdo

Artificial Intelligence and Training Data

What is Data, How can It be Used in Machine Learning and How is It Gathered?

Alex Moltzau
Towards Data Science
7 min readOct 18, 2019

--

After having made a post asking for topics to be explored on a group on Facebook about artificial intelligence and deep learning one of the most interesting ones that emerged was training data.

The question was:

“Can you try to write an article on importance of Training data and how to obtain them? For some models Training data is really hard to get like for custom speech to text, it’s hard for someone to get those many hours of speech data.”
— Sai Raghava

I have written about data in various articles, however I have not written about test data yet specifically so I will do my best.

First to give an explanation of what test data is in the context of artificial intelligence particularly machine learning techniques and thereafter go into a few thoughts on how to tackle such a problem.

Test data is data which has been specifically identified for use in tests, typically of a computer program.

“In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms work by making data-driven predictions or decisions, through building a mathematical model from input data.”

But first, what is data?

What is Data?

Although this should be a simple question it is not.

At a recent conference I was talking to Gloria González Fuster who is a Professor of Law at Vrije Universiteit Brussel.

She has worked most of her professional life attempting to navigate the legal definitions of data and she has a background in communication sciences as well.

Much of the data feeding AI systems are what the law describes as ‘personal data’, because they relate to somebody who could be identified.

These persons are, in turn, called by the law the ‘data subjects’, while those deciding about the processing are the ‘data controllers’.

She proposes re-qualifying as ‘data makers’ all those participating in the making of these data the processing.

She argues there are some major gaps in our understanding of data.

She asked the audience to close their eyes for a few seconds and attempt imagining what data is.

Try closing you eyes and imagining what data is for a few seconds.

Then of course – play this video.

Gloria communicated a fun story regarding a woman who asked Tinder about the data that helped the algorithm define which men it should show her. She found out that this was a rather challenging process.

Recognising that data definitions and data ownership is complicated we can move to a more binary definition.

Structured data refers to any data that resides in a fixed field within a record or file. This includes data contained in relational databases and spreadsheets.

Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner.

Input Data in Machine Learning

I have written previously about different types of learning frameworks. So I have summed up and generalised what this can look like in terms of data here.

Supervised learning infers a function from labeled training data.

Unsupervised learning helps find previously unknown patterns in data set without pre-existing labels.

Reinforcement learning is concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

The social scientist Max Weber describes ideal types when building models such as these: a construction of abstract, hypothetical concepts.

From an article by M. Tim Jones written in 2017 retrieved the 19th of October 2019.

Combining these and adding onto other approaches is occurring too.

Semi-supervised learning makes use of unlabeled data for training — typically a small amount of labeled data with a large amount of unlabeled data.

Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models, without actually collecting new data.

The last two in combination in a research article proposed by Google AI is what I was writing about previously.

The data used to build the final model usually comes from multiple datasets. This is used to fit the parameters such as the weights in artificial neural networks, however it can come from few datasets too being augmented or experimented on until you get a result. It can even start from no or little data rather with a specific goal in mind, the technical definition is a bit loose.

Cartoon by xkcd

To Aggregate or Not to Aggregate That is the Question

I would like to say that love is the answer to the question of aggregating data, but it is not.

Data aggregation is any process in which information is gathered and expressed in a summary form, for purposes such as statistical analysis. A common aggregation purpose is to get more information about particular groups based on specific variables such as age, profession, or income.

The answer is that aggregating data is a massive business and growing industry that has become using a variety of techniques to make it happen. If we return to part of the question:

…it’s hard for someone to get those many hours of speech data

We could take a step back or sideways to consider this.

…it’s hard for someone to get … data

It can be hard to get data. When you asked for data what you expected might not have been a critical look at data acquisition, however that is what you get. Here are two articles I have written to illustrate that point.

Databaiting (verb; deɪtəbeɪt.ɪŋ): to entice someone to submit their data by eliciting an emotional response.

One example I have given is spit → 23andMe was partly acquired by GlaxoSmithKline one of the largest pharmaceutical companies in the world which now uses your spit for drug testing (if you sent them your spit).

If we take the example of ‘speech data’ let us consider Alexa for a moment. The Amazon Alexa is collecting an unprecedented amount of speech data and it is unsure if they will delete it completely even if you asked them to, although we should hope they comply somewhat with GDPR.

If you asked someone: “Hey can you give me all the information regarding the layout of your home and your living patterns?” It sounds intrusive.

If you said: “Hey I have an amazing new vacuum cleaner that can clean your floors without you needing to lift a finger.” Easier.

Let us take the example of speech.

“Hey just upload your voice here and you can get a realistic sounding version of your own voice to fool your friends.” Sure sounds nice, very deep.

You can check out Mozilla asking you to donate your voice to consider the difference:

You could ask your friends to donate their voice to your machine learning project. If you have 10–100 friends that is data you could use. If you have more specific data within the field you are approaching that may be more specific to the problem you are going to address.

Although you should remember both in terms of technical debt and intellectual debt. If you loose track of what training data that has been used where in IT infrastructure it can be bad or even disastrous. Equally if you do not understand the decision-making you have trained perhaps due to an enormous data set it could pose its own risks.

  • What data is or not can be less straightforward than we might think.
  • If you don’t think so — then unstructured and structured data is a decent practical distinction.
  • There are different forms of learning that may place different requirements on your training data.
  • Remember where your training data is, when it was trained, and that it can become outdated.

These are some immediate thoughts resulting from an interesting question.

Another question is: how far are you willing to go in aggregating data?

How sure are you that your data is worth using?

Gif by @yipan

I hope you enjoyed this short article. This is day 137 of #500daysofAI. I write one new article about or related to artificial intelligence every day for 500 days.

--

--