Data Science | Statistics | Machine Learning

Types of Biases in Data

Biases in data that we should all be aware of to build a reliable and fair machine learning model

Swapnil Kangralkar
Towards Data Science
5 min readAug 26, 2021

--

Image created by the author

Machine learning models are increasingly used to make decisions or to inform decisions. For e.g. A model might influence a decision for approval of a loan, screening candidate resumes for a job application, etc. Such decisions are crucial and we need to be confident that our models don’t discriminate against ethnicity, gender, age, or any such factors. Many machine learning models can often contain unintentional bias that could result in unreliable and unfair outcomes. Building and evaluating a good machine learning model requires doing more than just calculating loss metrics. Before operationalizing a model, it is important to analyze your training data and sometimes the source of the data to look for biases.

In this article, we will look at different types of biases that can manifest in training data.

1. Reporting Bias:

Reporting bias (also known as selective reporting) takes place when only a selection of results or outcomes are captured in a data set, which typically covers only a fraction of the entire real-world data. It is people’s tendency to under-report all the information available.

Types of reporting bias -

  1. Citation bias: occurs when your analysis is based on studies found in the citations of other studies.
  2. Language bias: occurs when you ignore reports not published in your native language.
  3. Duplicate publication bias: occurs when some studies are weighted more because they are published in more than one place.
  4. Location bias: occurs when certain studies are harder to locate than others.
  5. Publication bias: occurs when studies with positive findings are more likely to be published than studies with negative findings or no significant findings.
  6. Outcome reporting bias: occurs when there is selective reporting of certain outcomes. For e.g. you only report when the company posts positive earnings in a quarterly report.
  7. Time lag bias: occurs when some studies take years to publish.

2. Automation Bias

Automated bias is a tendency of humans to favor results or suggestions generated by automated systems and to ignore contradictory information made by non-automated systems, even if it is correct.

Read about a real-life example of automation bias here.

3. Selection Bias

Selection bias takes place when data is chosen in a way that is not reflective of real-world data distribution. This happens because proper randomization is not achieved when collecting data.

Types of selection bias -

  1. Sampling bias: occurs when randomization is not properly achieved during data collection.
  2. Convergence bias: occurs when data is not selected in a representative manner. e.g. when you collect data by only surveying customers who purchased your product and not another half, your dataset does not represent the group of people who did not purchase your product.
  3. Participation bias: occurs when the data is unrepresentative due to participations gaps in the data collection process.

So let’s say Apple launched a new iPhone and on the same day Samsung launched a new Galaxy Note. You send out surveys to 1000 people to collect their reviews. Now instead of randomly selecting the responses for analysis, you decide to choose the first 100 customers that responded to your survey. This will lead to sampling bias since those first 100 customers are more likely to be enthusiastic about the product and are likely to provide good reviews.

Next, if you decide to collect data by surveying only Apple customers by opting out of Samsung customers, you will induce a convergence bias in your dataset.

Lastly, you send the survey to 500 Apple and 500 Samsung customers. 400 Apple customers respond but only 100 Samsung customers respond. Now, this dataset would be underrepresenting the Samsung customers and would count towards participation bias.

4. Overgeneralization Bias

Image created by the author

Overgeneralization occurs when you assume what you see in your dataset is what you would see if you looked in any other dataset meant to assess the same information, regardless of the size of the dataset.

5. Group Attribution Bias

People tend to stereotype a whole group just because of the actions of a few individuals within the group. This tendency to generalize what is true of individuals to an entire group to which they belong is termed as Group Attribution Bias.

Types of Group Attribution Bias -

  1. In-group bias: occurs when you give preference to members of a group to which you personally belong or share common interests with. For E.g. A manager creating a job description for a data scientist position believes that suitable applicants must have a Master’s degree because he/she has one too (irrelevant of their work experience).
  2. Out-group bias: occurs when you stereotype individual members of a group to which you personally do not belong. For E.g. A manager (with a master’s degree) creating a job description for a data scientist position believes that applicants who do not hold a master’s degree do not have sufficient expertise for the role.

6. Implicit Bias

Implicit bias occurs when assumptions are made based on one’s own personal experiences that do not necessarily apply more generally. People tend to act on the basis of prejudice and stereotypes without intending to do so.

For e.g. A computer vision engineer from North America marks the color red as a danger. However, the same color red is a popular color in Chinese culture that symbolizes luck, joy, and happiness.

Type of Implicit Bias -

  1. Confirmation bias or experimenter’s bias: is the tendency to search for information in a way that confirms or supports one’s prior beliefs or experiences. e.g. you trained a model to rank sports cars according to their speed using some features. Your model results show that Ferrari was faster than Ford. However, a few years back you remember watching a movie where Ford beats Ferrari and you believe that Ford is faster than Ferarri so you keep training and running the model until the model gives you the results you believe.

Thank you for reading. Get in touch if you have further questions via LinkedIn.

--

--