
The amount of structured and unstructured data generated has grown exponentially over the last few decades and will continue to do so for years to come. The ‘Big Data‘ analytics could potentially overcome numerous challenges that corporations and governments have faced for centuries while making decisions: the lack of adequate data for formulating policies (e.g., targeting policies for a particular social group) or examining market or consumer expectations (e.g., recommendation system). The descriptive as well as predictive modeling that is driven by the big data paradigm can help decision-makers derive valuable insights for personal, commercial, or collective gains.
However, the modern data collection process and algorithms remain susceptible to data mining biases. Without taking appropriate measures, the big data can amplify the negative effect of the existing social issues (e.g., racial discrimination) and render the findings worthless or even counterproductive [1], [2]. The purpose of this blog is to explore the potential sources of biases that can be introduced during data-mining and possibly the feature engineering stages of a big-data project.
Garbage in garbage out – true for big data too
The big-data has potential to overcome the limitations of data availability, an issue that has plagued the traditional statistical analysis for decades, but the value of such analytics is also vulnerable to the quality of data as well as the poor application of the data. With growing penetration of digital gadgets, it is increasingly becoming easier than ever before to collect more data of all kinds. For example, a smart phone has ability to record in real-time where you are, what you are doing, how you are spending your time and money, who you are meeting and for how long, and many such personal details. Notwithstanding the privacy and surveillance concerns, such information has already transformed the economy and the way we live.
Nonetheless, just like the old-school data analytics, the success of the new data paradigm also depends on the validity of the same assumptions – the target population include all members associated with the problem being investigated; the sample reflects the statistical population; and the choice of the modeling variables is appropriate. The violation of any of the assumptions can result in biased conclusions, thereby lowering the value of machine learning for solving many of the real-world problems.
Figure 1 depicts a typical sequence of steps in a machine learning exercise and shows how different mistakes can reinforce the existing problems. While careful use will provide effective solutions and reduce our problems, any mistake can perpetuate or worsen the existing problems (disparities, discrimination, or something else).
![Image by author based on information in [2]](https://towardsdatascience.com/wp-content/uploads/2021/02/1yRKCQ5SNMpLVvSTxlEzJ-A.png)
Part I: Identifying target population
One of the early steps, if not the first step, in a data analytics exercise is to ask who all or what members would most likely have the information we need or are familiar with the issue we trying to address. This group of members is our desired population (let’s call it problem population) to considered for the sample collection. The representative-ness of the sample for training data must be measured only against problem population that has a direct connection to the problem.
For example, say a road maintenance department wants to use crowdsourcing (via smart-phones) to monitor road conditions across a city [1]. While it may seem that all people living in the area should be the target population, it is only the people traveling on the roads (as passengers or drivers) in the area that would form the actual population of interest (our problem population). The accelerometer data from the devices of people who do not travel or use cars will not enhance any value. Similarly, for a hiring goal, all potentially qualified (meeting mandatory qualification requirements) employees are the sampling population rather than all graduates or people looking for employment. Another obvious illustration is the sampling population for gauging who would win an election. The opinions of most teenagers do not matter in an opinion poll because only voters (people over the age of 18 years in most countries) constitute the population for an election survey. It does not matter how many younger people are surveyed; the survey outcome or any prediction model based on such information will be useless.
While completing this step may be trivial in most cases with an appropriately described problem, any unintended mistakes could have huge implications for the accuracy and relevance of our work. In our case, after defining the target population of all motorists, the road department now has to devise a strategy to gather data from a representative, even if small, number of roadies (excluding cyclists and walkers for the moment).
Part II: Ensuring representative training data
The value of any small or big-data applications depends on how accurately the training data describes the problem population. While the big data implies a huge dataset, it rarely represents the entire problem population and captures only a portion of it, which increase the importance of representative sampling. Creating a training dataset that does not reflect the composition of the underlying problem population will emphasize certain members and under-represent others. The Wikipedia has a long list of sampling biases, including selection bias, exclusion bias, reporting bias, detection bias, and the list goes on. In fact, a small, but representative sample can produce more reliable results that a large, biased dataset [3].
For example, in case of the road monitoring project mentioned above, if the department relies on accelerometers in the smart-phones to collect the information, it is possible that low-income group lacking such high-tech gadgets or adequate internet access fails to be part of the sample [1]. The biased data could result in unfair allocation of tax-payers money for road development and repairs, leading to systemic disparities in the quality of road infrastructure within a city. Poor access to the internet can also limit what applicants are shortlisted for a position, or the choice of words in a resume can cause gender-biased hiring [4].
Part III: Choosing representative features
The choice of the predictors also affects how different members of the population are considered in the model. Even when we correctly identify problem population and draw a representative sample for training, certain disparities are embedded in the input variables. There could be many features that capture an incomplete picture of certain members of the population, and assigning greater weight to the biased predictors will produced an undesirable outcome. So, although all population strata are well described by the sample, the feature engineering can indirectly under- or over-represent certain groups and influence the fairness of the model.
For example, a small correlation between apparently neutral features can produce unfair results. If aggressive drivers and elderly both preferred red cars, a decision to charge higher insurance premium for red cars to punish the bad drivers would indirectly be biased against the elderly [2].
As quoted in [2], "mis-represented groups coincide with social groups against which there already exists social bias such as prejudice or discrimination, even unbiased computational processes can lead to discriminating decision procedures". Such risks are valid for sampling as well as feature selection steps.
For example, hiring good employees depends on the definition of what counts as ‘good’, which can be based on the average length of past employment or proven achievement record. Using the tenure length as a feature can disproportionately exclude certain people working in an industry with high hiring turn-over (or attrition) rates [1]. While it is a desirable proxy for measuring employee loyalty, its unnecessary inclusion in the model may have dominating influence and perpetuate the very disparities that big-data is believed to be reducing. Therefore, the despite the best efforts to define population and gather good data, selecting wrong features can introduce systemic biases in the results and contributes to the problem.
Final thoughts
The big data analytics has potential to overcome many biases resulting from individuals and human decision-making and present a rational, representative picture of the population. The purpose of this blog was to present an introductory overview of how biases can become part of a machine learning modeling exercise. While performing data-mining and doing a machine learning project, the analysts must be aware of the following three causes that can dilute the efficacy of a machine learning project:
· Failure to correctly identify the statistical (problem) population relating to the problem
· Using unrepresentative training data for predictive modeling
· Biased choice of features for predicting a target variable
As the common saying goes ‘with great power comes great responsibility’, any carelessness in data mining and feature engineering could worsen the existing problems and the efforts would become counter-productive. Only the combined effect of big data, its ease of access, and sensible data mining can potentially revolutionize our daily lives towards an environmentally sustainable, socially equitable, and economically vibrant world.
References
[1] S. Barocas and A. D. Selbst, "Big data’s disparate impact," Calif. Law Rev., vol. 104, no. 3, pp. 671–732, Feb. 2016.
[2] E. Ntoutsi et al., "Bias in data-driven artificial intelligence systems – An introductory survey," WIREs Data Min. Knowl. Discov. , vol. 10, no. 3, p. e1356, May 2020.
[3] J. J. Faraway and N. H. Augustin, "When small data beats big data," Stat. Probab. Lett., vol. 136, pp. 142–145, 2018.
[4] J. Manyika, J. Silberg, and B. Presten, "What do we do about the biases in AI?," 2019. [Online]. Available: https://hbr.org/2019/10/what-do-we-do-about-the-biases-in-ai.