The world’s leading publication for data science, AI, and ML professionals.

Artificial Intelligence and Bad Data

Facebook, Google, and twitter lawyers gave testimony to congress on how they missed the Russian influence campaign. Even though the ads…

Facebook, Google, and twitter lawyers gave testimony to congress on how they missed the Russian influence campaign. Even though the ads were bought in Russian currency on platforms chalk full of analytics engines, the problematic nature of the influence campaign went undetected. "Rubles + US politics" did not trigger an alert, because the nature of off-the-shelf deep learning is that it only looks for what it knows to look for, and on a deeper level, it is learning from really messy (unstructured) or corrupted and biased data. Understanding the unstructured nature of public data (mixed with private data) is improving by leaps and bounds every day. That’s one of the main things I work on. Let’s focus instead on the data quality problem.

Data can be wrong. Image taken from this post.
Data can be wrong. Image taken from this post.

Here are a few of the many common data quality problems:

  • Data sparsity: We know a bit of the picture about a lot of things, but no clear picture on most things.
  • Data corruption: Convert a PDF to text and print it. Yeah. Lots of garbage comes out besides the text.
  • Lots of irrelevant data: In a chess game, we can prune whole sections of the tree search, and more generally, in a picture of a cat, most of the pixels don’t tell us how cute the cat is. In totally random data, we humans (and AI) can see patterns where there really is none.
  • Learning from bad labeling: Bias of the labeling system, possibly due to human bias.
  • Missing unexpected patterns: Black swans, regime change, class imbalance, etc.
  • Learning wrong patterns: Correlation that is not really causation can be trained into an AI, which then assumes wrongly that the correlation is causative.
  • I could go on.
Bad data is hard to analyze with off-the-shelf systems. Yep. AI is one tough business. Credit: Robert Taylor
Bad data is hard to analyze with off-the-shelf systems. Yep. AI is one tough business. Credit: Robert Taylor

We know that labelled data is really hard to come by for basically any problem, and even labelled data can be full of bias. I visited a prospective client on Friday that had a great data team but no ability to collect the data they needed from the real world because of ownership and IP issues. This "Rubles + US politics" example of good data that is missed by AI is not surprising to experts. Why? Well, AI needs to know what to look for, and the social media giants were looking for more aggressive types of attacks like monitoring soldier’s movements based on their facebook profiles. Indeed, the reason we miss signals from good data in general is the huge amount of BAD data in real systems like twitter. This is a signal to noise ratio problem. If there are too many alerts, the alert system is ignored. Too few, and the system misses critical alerts. It is not only adversaries like the Russians trying to gain influence. The good guys, companies and brands, do the same thing. Drip campaigns and guerrilla marketing are just as much a tactic for spreading influence in shoe sales as in political meddling in an election. So, the real reason we miss signals from good data is bad data. Using simple predicate logic, we know that False assumptions can imply anything (also this). So learning from data we know is error-riddled carries some real baggage.

Let's just agree that the data is wrong. Credit: (not original author)
Let’s just agree that the data is wrong. Credit: (not original author)

One example of bad data is finding that your AI model was trained on the wrong type of data. The text from chat conversation is not like text from a newspaper. Both are composed of text, but their content is very different. AI trained on the Wikipedia dataset or Google News articles will not correctly understand (i.e. "model") the free-form text we humans use to communicate in chat applications. Here is a slightly better dataset for that, and maybe the comments from the hackernews dataset too. Often we need to use the right pre-trained model or off the shelf dataset for the right problem, and then do some transfer learning to improve from the baseline. However, this assumes we can use the data at all. many public datasets have even bigger bad data problems that cause the model to simply fail. Sometimes a field is used and sometimes it is left blank (sparsity), Sometimes non-numeric data creeps into numerical columns ("one" vs 1). I found an outlier in a large private real estate dataset where one entry among a million was a huge number entered by a human as a fat finger error.

Problems like the game of go (AlphaGo zero) has no bad data to analyze. Instead the AI evaluates more relevant and less relevant data. Games are a nice constrained problem set, but in most real world data, there is bias. Lots of it. Boosting and other techniques can be helpful too. The truth is that some aspects of machine learning are still open problems, and shocking improvements happen all the time. Example: Capsule network beats CNN.

It is important to know when error is caused by bad things in the data rather than caused by improperly fitting to the data. And live systems that learn while they operate, like humans do, are particularly susceptible to learning wrong information from bad data. This is kind of like Simpson’s paradox, in that the data is usually right, and so fitting the data is a good thing, but sometimes fitting to the data produces paradoxes because the method itself (fitting to the data) is based on a bad assumption that all data is ground truth data. See this video for more on Simpson’s paradox fun. And here is another link to Autodesk’s datasaurus, which I just love. It is totally worth reading in full.

These images are all RANDOM data. I want to drive home the point that "trends" can be found in lots of places, and the law of big numbers is not always there to come to the rescue. IN the bar graph it looks like something special happens at 7. It's just random. In the pie chart with 3 colors it looks like 1 is more prevalent than 2 and 3. Nope. Random. The pie chart with lots of slices is a case where we start to see the numbers averaging out, but that's the point. Sometimes your dataset has all sorts of garbage in it that you don't know about.
These images are all RANDOM data. I want to drive home the point that "trends" can be found in lots of places, and the law of big numbers is not always there to come to the rescue. IN the bar graph it looks like something special happens at 7. It’s just random. In the pie chart with 3 colors it looks like 1 is more prevalent than 2 and 3. Nope. Random. The pie chart with lots of slices is a case where we start to see the numbers averaging out, but that’s the point. Sometimes your dataset has all sorts of garbage in it that you don’t know about.

We talked about the fact that most real-world data is full of corruption and bias. That kind of sucks, but not all is lost. There are a variety of techniques for combating bad data quality, not the least of which are collecting more data, and cleaning up the data. More advanced techniques like ensembles with NLP, knowledge graphs and commercial-grade analytics are not easy to get your hands on. More on this in future articles.

If you enjoyed this article on bad data and Artificial Intelligence, then please try out the clap tool. Tap that. Follow us on medium. Share on Facebook and twitter. Go for it. I’m also happy to hear your feedback in the comments. What do you think?

Happy Coding!

-Daniel [email protected] ← Say hi. Lemay.ai 1(855)LEMAY-AI

Other articles you may enjoy:


Related Articles