The world’s leading publication for data science, AI, and ML professionals.

Tautologies in Data

Why data ≠ analysis

Data Science

As I researched some applications for Data in precision medicine, I came across an interesting claim in "From Big Data to Precision Medicine" an article in Frontiers in Medicine. The article states:

"However, ‘Big data’ no longer means what it once did. The term has expanded and now refers not to just large data volume, but to our increasing ability to analyze and interpret those data. Tautologies such as ‘data analytics’ and ‘data science’ have emerged to describe approaches to the volume of available information as it grows ever larger." [1]

Image Courtesy of Dhruv Weaver on Unsplash
Image Courtesy of Dhruv Weaver on Unsplash

We should probably start with understanding what a tautology even is. There are a few definitions depending on the situation, but they all have an element of circular reasoning or being true by their logical form. This can mean saying the same thing in different words, or creating arguments that are "true by definition" or in all possible scenarios.

Some examples may help…

  • "To depreciate in value" → "To depreciate" means "to lessen in value" so our initial statement reads "to lessen in value in value". So we are saying the same thing in different words.
  • "It is what it is" → This is true by definition.
  • X=Y or X ≠Y → This is true in all possible interpretations.
  • "GPS system" → G.P.S. stands for global positioning system, so we don’t need to include "system" after "GPS".

So, let’s try to understand the authors’ argument from above.

I’ll try to paraphrase: "Because ‘Big Data’ has a new definition reflecting not just the size of available data, but also the ability to analyze it, the term ‘data analytics’ is now a tautology. Analysis is already encapsulated in ‘data’, so ‘analytics’ is repetitive."

However, I don’t know if I agree that the definition of Big Data has changed to include analysis. And as a semantic issue, the authors don’t say that "big Data Science" is a tautology, but "data science" generally.

Here’s my take. Data by itself is passive, inert, waiting to impart information upon being analyzed. Analysis is active. We synthesize the knowledge from data through analysis. I don’t agree with the authors that "Big Data" includes analysis without that active exploration and understanding by data scientists.

Does this really matter? Probably not. I’m dissecting a single throwaway line from the introduction of this article, but I thought it was an interesting way to discuss a more interesting problem in data science.

As the volume of data grows exponentially, we have been inundated with new solutions to every problem using artificial intelligence, data science, autoML, and so much more. I think it’s easy to sit back and think that once we have enough data, we can easily solve X problem and forget that data ≠ analysis. We need to be active in our pursuits and remember that data in and of itself is not understanding.


Tautological Bias

Let’s discuss a more prevalent issue featuring Tautologies in data. Tautological bias occurs when we use a feature of our data that is heavily or perfectly correlated with the target feature to predict that target feature. For example, if we want to predict what percentile a customer will be in compared to other customers based on their purchase history. So a big spender will be in a high percentile because they, for example, spent more than 78% of customers.

Maybe we have time_spent_in_store, num_days_shopped, num_items_purchased, total_spent, largest_purchase, … and more. We don’t have much time, so we just throw everything into a model and boom, we get an outstanding accuracy. Job done? Not quite…

Can you spot the tautological bias? We used a feature that perfectly predicts the percentile, because it’s what the percentile is built on! The percentile is basically just a ranking of all the customers by total_spent, so of course including total_spent in the model is going to perfectly predict percentile.

This is cheating in several ways, mainly we’ve cheated in prediction because we have all the information already, but we’ve also cheated ourselves of a deeper understanding of the data! By including this multicollinearity in our features that perfectly predicts the target, we may miss that frequent shoppers spend more, that a majority of customers only spend a few minutes in the store, or that the largest purchase has no relation to overall spending.


Conclusion

I tackled two different cases where the concept of tautologies are important. First, at a high level, we should beware of when we’re promised all the answers to incredibly complex problems "once we’ve collected enough data". Data itself stores information, but without focused exploration and analysis, it is not a solution. Human ingenuity is difficult to replace with "one size fits all" solutions.

Second, beware of tautological biases. If we have features that are directly related to the target feature, don’t be surprised when your results look too good to be true! Always test for multicollinearity and test specific hypotheses to get the most out of your data.


Connect

I’m always looking to connect and explore other projects! You can follow me on GitHub or LinkedIn, and check out my other stories on Medium. I also have a Twitter!


Sources

[1] T. Hulsen, S. Jamuar, A. Moody, et al. [From Big Data to Precision Medicine](http://[X] N. Name, Title (Year), Source) (2019), Frontiers in Medicine 6:34.


Related Articles