# Five Machine Learning Paradoxes that will Change the Way You Think About Data

Paradoxes are one of the marvels of human cognition that are hard to using math and statistics. Conceptually, a paradox is a statement that leads to an apparent self-contradictory conclusion based on the original premises of the problem. Even the best-known and well-documented paradoxes regularly fool domain experts as they fundamentally contradict common sense. As artificial intelligence(AI) looks to recreate human cognition, it’s very common for machine learning models to encounter paradoxical patterns in the training data and arrive to conclusions that seem contradictory at first glance. Today, I would like to explore some of the famous paradoxes that are commonly found in machine learning models.

Paradoxes are typically formulated at the intersection of mathematics and philosophy. A notorious philosophical paradox is known as the Ship of Theseus questions whether an object that has had all of its components replaced remains fundamentally the same object. First, suppose that the famous ship sailed by the hero Theseus in a great battle has been kept in a harbor as a museum piece. As the years go by some of the wooden parts begin to rot and are replaced by new ones. After a century or so, all of the parts have been replaced. Is the “restored” ship still the same object as the original? Alternatively, suppose that each of the removed pieces were stored in a warehouse, and after the century, technology develops to cure their rotting and enable them to be put back together to make a ship. Is this “reconstructed” ship the original ship? And if so, is the restored ship in the harbor still the original ship too?

The field of mathematics and statistics if full of famous paradoxes. To use a couple of famous examples, legendary mathematician and philosopher Bertrand Russell formulated a paradox that highlighted a contradiction in some of the most powerful ideas in set theory formulated one of the greatest mathematicians of all time: Greg Cantor. In essence, the Russell paradox questions whether a “list of all lists that do not contain themselves”. The paradox arises within native set theory by considering the set of all sets that are not members of themselves. Such a set appears to be a member of itself if and only if it is not a member of itself. Hence the paradox. Some sets, such as the set of all teacups, are not members of themselves. Other sets, such as the set of all non-teacups, are members of themselves. Call the set of all sets that are not members of themselves “*R*.” If *R* is a member of itself, then by definition it must not be a member of itself. Similarly, if *R* is not a member of itself, then by definition it must be a member of itself. What????

### Famous Paradoxes in Machine Learning Models

As any form of knowledge building based on data, machine learning models are not exempt of cognitive paradoxes. Quite the opposite, as machine learning try to infer patterns hidden in training datasets and validate their knowledge against a specific environment, they are constantly vulnerable to paradoxical conclusions. Here are a few of the most notorious paradoxes that surface in machine learning solutions.

#### The Simpson’s Paradox

Named after British mathematician Edward Simpson, the Simpson’s Paradox describes a phenomenon in which a trend that is very apparent several groups of data dissipates as the data within those groups in combined. A real-life case of the paradox happened in 1973. Admission rates were investigated at the University of Berkeley’s graduate schools. The university was sued by women for the gender gap in admissions. The results of the investigation were: When each school was looked at separately (law, medicine, engineering etc.)*, women were admitted at a higher rate than men! *However, the average suggested that men were admitted at a much higher rate than women. How is that possible?

The explanation to the previous use case is that a simple average doesn’t account for the relevance of a specific group within the overall dataset. In this specific example, women applied in large numbers to schools with low admission rates: Like law and medicine. These schools admitted less than 10 percent of students. Therefore the percentage of women accepted was very low. Men, on the other hand, tended to apply in larger numbers to schools with high admission rates: Like engineering, where admission rates are about 50%. Therefore the percentage of men accepted was very high.

In the context of machine learning, many unsupervised learning algorithms infer patterns different training datasets that result in contradictions when combined across the board.

#### The Braess’s Paradox

This paradox was proposed in 1968 by German mathematician Dietrich Braes. Using an example of congested traffic networks, Braes explained that, counterintuitively, adding a road to a road network could possibly impede its flow (e.g. the travel time of each driver); equivalently, closing roads could potentially improve travel times. Braess reasoning is based on the fact that, in a Nash equilibrium game, drivers have no incentive to change their routes. In terms of game theory, an individual has nothing to gain from applying new strategies if others stick to the same ones. Here in the case of drivers, a strategy is a route taken. In the case of Braess’s paradox, drivers will continue to switch until they reach Nash equilibrium despite the reduction in overall performance. So, counter-intuitively, closing the roads might ease the congestion.

The Braess’s Paradox is very relevant in autonomous, multi-agent reinforcement learning scenarios in which the models needs to reward agents based on specific decisions in unknown environments.

#### The Moravec’s Paradox

Hans Moravec can be considered one of the greatest AI thinkers of the last few decades. In the 1980s, Moravec formulated a counter-intuitive proposition to the way AI models acquire knowledge. The Moravec Paradox states that, contrary to popular believe, high-level reasoning requires less computation than low-level unconscious cognition. This is an empirical observation that goes against the notion that greater computational capability leads to more intelligent systems.

A simpler way to frame the Moravec’s Paradox is that AI models can do incredibly complex statistical and data inference tasks that result impossible for humans. However, many tasks that result trivial for humans like grabbing an object will require expensive AI models. As Moravec writes, “it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility”.

From the perspective of machine learning, the Moravec’s Paradox is very applicable in aspect of transfer learning that look to generalize knowledge across different machine learning models. Additionally, the Moravec’s Paradox teaches us that some of the best applications of machine intelligence will come as a combination of humans and algorithms.

#### The Accuracy Paradox

Directly related to machine learning, the Accuracy Paradox states that, counterintuitively, accuracy is not always a good metric to classify the effectiveness of predictive models. How is that for a confusing statement? The Accuracy Para has its roots in imbalanced training datasets. For instance, in a dataset in which the incidence of category A is dominant, being found in 99% of cases, then predicting that every case is category A will have an accuracy of 99% is completely misleading.

A simpler way to understand the Accuracy Paradox is to find the balance between precision and recall in machine learning models. In machine learning algorithms, precision is often defined as measuring what fraction of your predictions for the positive class are valid. It is formulated by (True Positives / True Positives + False Positives). Complementary, the recall metric measures how often your predictions actually capture the positive class. It is formulated by (True Positives / True Positives + False Negatives).

In many machine learning models, the balance between precision and recall results a better metric for accuracy. For instance, in case of an algorithm for fraud detection recall is a more important metric. It is obviously important to catch every possible fraud even if it means that the authorities might need to go through some false positives. On the other hand, if the algorithm is created for sentiment analysis and all you need is a high-level idea of emotions indicated in tweets then aiming for precision is the way to go.

#### The Learnability-Godel Paradox

Saving the most controversial for last, this is a very recent paradox that was published in a research paper earlier this year. The paradox links the ability of a machine learning model to learn to one of the most controversial theories of mathematics: Gödel’s Incompleteness Theorem.

Kurt Gödel is one of the brightest mathematicians of all time and one that pushed the boundaries of philosophy, physics and mathematics like a few of its predecessors. In 1931, Gödel published his two incompleteness theorems that essentially say some statements cannot be proved either true or false using standard mathematical language. In other words, math is an insufficient language to understand some aspects of the universe. The theorems have come to be known as Gödel ‘s continuum hypothesis.

In a recent work, AI researchers from the Israel Institute of Technology linked Gödel’s continuum hypothesis to the learnability of a machine learning model. In a paradoxical statement that challenges all common wisdom, the researchers define the notion of a learnability limbo. Essentially, the researchers go on to show that if the continuum hypothesis is true, a small sample is sufficient to make the extrapolation. But if it is false, no finite sample can ever be enough. This way they show that the problem of learnability is equivalent to the continuum hypothesis. Therefore, the learnability problem, too, is in a state of limbo that can be resolved only by choosing the axiomatic universe.

In simple terms, the mathematical proofs in the study show that AI problems are subjected to Gödel ‘s continuum hypothesis which means that many problems might be effectively unsolvable by AI. Although this paradox has very little applications to real world AI problems today, it will be paramount to the evolution of the field in the near future.

Paradoxes are omnipresent in machine learning problems in the real world. You can argue that as, algorithms don’t have a notion of common sense, they might be immune to statistical paradoxes. However, given that most machine learning problems require human analysis and interventions and are based on human-curated datasets, we are going to live in an universe of paradoxes for quite some time.