The world’s leading publication for data science, AI, and ML professionals.

Representation, Big Data, and algorithmic bias in Social Data Science

The buzz-phrase of our time, "Artificial intelligence," has inspired all sorts of musings: from visions of utopic – and dystopic – futures…

Fairness and Bias

Representation and Bias

Pushing the boundaries of Social Data Science

Photo by NASA on Unsplash
Photo by NASA on Unsplash

The buzz-phrase of our time, "Artificial intelligence," has inspired all sorts of musings: from visions of utopic – and dystopic – futures freed from human labour, to more focused critiques on algorithmic discrimination. The reality is the ‘good, bad and the ugly’ of AI drives some of the most fundamental philosophical and societal questions we face today – and consequently much of our evolving tech legislation and policy.

Algorithms have become ever faster at processing unfathomable amounts of information, as computing power continues to increase. Algorithms have taken over such diverse aspects of our lives it is hard to keep track of the decision-making and ‘intelligent’ technologies they enable: search engine results, ad content, creditworthiness, recommendation for movies we might like, eligibility for a job, Uber routes and drivers, genetic predispositions – they even help doctors make diagnoses and compose music in a particular style. The list goes on.

A lot has been written about the opportunity-cost of our now data-centric world, yet we are all still largely oblivious to when and how these algorithms are actually employed. Either their functioning is proprietary to corporations; or – and this is especially true for very complex decision-making algorithms like neural networks or deep learning algorithms – even the developers themselves cannot explain why certain inputs yield certain outcomes. Under the guise of being ‘objective’, ‘neutral,’ or at least free from the quandaries of human bias, algorithms have become an ever-present aspect of our reality, many times for good reason. Other many times, less so.

Most of the data that we generate from our digital interactions, which is then used to predict aspects of our behaviour, is used explicitly to expand profit margins: the Cambridge Analytica – Facebook scandal, is an obvious case in point. Horror stories related to algorithms abound: résumé screening tools that learned to be sexist (having been fed a history of male-dominant résumés); facial recognition technologies that are unacceptably bad at recognising people with dark-skin; even an app that claims it can detect terrorists or paedophiles based on their faces, Faception. Other, wildly less harmful stories, include algorithms that have persistently shown tampon ads to men on YouTube.

We are, perhaps blissfully, unaware of the extent to which algorithms can curtail some of our civil and political rights by exacerbating the already fraught inequalities built into our societies and institutions. Algorithms, especially machine learning algorithms (where a system ‘learns’ to identify a number of patterns to make predictions based on a training dataset), tend to replicate the racial injustices of the society they are trying to predict for, as Hao argues for the MIT Technological Review.

Algorithms in and of themselves are not inherently unfair, of course; they follow code and are statistical-based tools. The bias stems from the datasets – collated by human hands – on which those algorithms are trained. These are often flawed and of low-quality, i.e. incomplete, skewed or unrepresentative of the population they are trying to make predictions for. Even more comprehensive datasets can be insufficient to address all the issues: they can be collected without recourse to privacy or ethical considerations (as was the case when Google was caught sending teams out to collect facial scans, without being upfront about what they were collecting data for, or what they were gathering consent for).

Researchers at NYU and the AI Now Institute point out that predictive-analytics algorithms can also be fed what they call "dirty data:" ‘robust’ data collated from the historical practices of the criminal and/or policing systems, but which have been at many times notoriously egregious, racially-unjust, and violating of civil rights. The researchers specifically examine predictive policing systems in thirteen jurisdictions in the US, put in place either to allocate police resources or to try to forecast criminal activity. They conclude that any attempt to deploy these automated systems must be addressed cautiously, and more importantly, alongside mechanisms that allow the public to know about, critique and reject them if necessary. As the Center for Democracy and Technology points out, algorithms that try to predict violent crime have very low accuracy rates anyway: the probability that a violent crime will occur is statistically low, and the data from which the algorithm can learn represents a very small proportion of the population.

We are then clearly faced with a question of representation. Who and what do these ‘Big Data’ sets stand for? What are they used for? What does it mean when we make, what are in the end, statistical assumptions about individuals to predict a recidivism score, or whether they should be hired for a job? Especially if these decisions are made by algorithms we do not fully understand, or not open to scrutiny? What we are asking, in the end, are also fundamental questions about ethics, equality and therefore also democracy.

Photo by Emily Morter on Unsplash
Photo by Emily Morter on Unsplash

Who gets to ask the important questions?

In an article published in the journal Information, Communication and Society, danah boyd and Kate Crawford, called for a critical examination of ‘Big Data’ approaches to social research – and their assumptions – back in 2012. As they pointedly note, algorithms and ideas about ‘Big Data’ have changed the ways we think about what research entails, even what counts as knowledge; they represent what Leslie Burkholder (back in 1992) called a computational turn in thinking and research more broadly. boyd and Crawford define the idea of ‘Big Data’ as a "cultural, technological and scholarly phenomenon," that partly rests on the myth that large amounts of data offer a "higher form of intelligence" and knowledge that "was previously impossible" to obtain.

They make the case that the fear and hope that surround our ideas about AI’s possibilities, obscure the more subtle, but also more pertinent and practical issues that arise when using these methods in social science research. They note that there is poor access to historical data on Twitter and Facebook: research questions, therefore, can and are becoming limited to the present, or at least the immediate past. Also, that only the social media companies have unrestricted access to all of the data. Although some of it can be purchased, the logistics involved in the transaction, reinforces educational inequalities: top-tier universities (mostly located in what we understand to be the Global North) are asking the questions they consider to be important, as only they can afford to do so. They also note that it is also only those trained and versed in computational methods, scraping, and APIs at these elite institutions that seem placed to answer them – at the moment at least, these scientists have also tended to be male.

Of course, some algorithms using large datasets have been shown to work efficiently, or at least as intended. There’s an entire field in biology, bioinformatics, dedicated to gaining insights from inordinately large biological datasets using computational and statistical methods. The field has been highly successful at finding associations at the molecular level, including between specific regions of DNA (genes) and predisposition to disease. Scientists have discovered (and genetic testing companies have profited from) one of the main genes associated with the risk of developing breast cancer: BRCA2, and a plethora of other genes associated with specific phenotypes (that can tell you things as mundane as whether you will hate the taste of cilantro).

We could, no doubt simplistically, conclude that most of the trouble with AI stems from applying it to social science questions that might have no ‘right’ answers. Where science studies ‘objects’, the humanities, studies ‘subjects’, we are often told. The reality is, as always, infinitely more complex.

Photo by Gayatri Malhotra on Unsplash
Photo by Gayatri Malhotra on Unsplash

Alleghany County

The case of Alleghany County in Pittsburgh comes to mind: the first county in the US to implement a predictive-analytics algorithm to offer a ‘second opinion’ on incoming calls on allegations of child abuse to the police department. The algorithm’s job is to decide whether such a call requires an intervention. According to Dan Hurley, science journalist for The New York Times, it takes into account over 100 ‘criteria’ or data points from psychiatric services, jails, welfare benefits, drug and alcohol treatments, among many others. It is taxing and time-consuming for a call screener to find even a fraction of this information; the algorithm returns a decision in seconds. The data used to train the algorithm is (expectedly) biased against African-Americans – Erin Dalton, deputy director of human services at the county tells Dan Hurley for the interview: black children are already being "over-surveilled" in the system. Yet for Dalton and Walter Smith Jr., the county’s Office of Children, Youth and Families deputy director, who is black, the algorithm produces less bias, than human screeners would. It is also used only to decide what to investigate – not which children to remove from their respective homes.

Importantly, the Alleghany algorithm, developed by two social scientists, Rhema Vaithianathan and Emily Putnam-Hornstein, is property of the county – unlike other similar algorithms developed across the US that are trade secret of private companies and cannot be audited. Alleghany’s workings are open to scrutiny by academics and specialists, parents, and even former foster care children at public meetings that were held before its adoption.

Photo by National Cancer Institute on Unsplash
Photo by National Cancer Institute on Unsplash

Bio-data and Bias

Biology in particular is a science that shares some of the pitfalls involved in collating ‘Big Data.’ In an interview for this piece, Dr. Daniel Martin Herranz, Chief Scientific Officer at the biotech start-up Chronomics – a company that uses machine learning algorithms to calculate biological age (vis-a-vis chronological age) from markers in your DNA – explains one of the least discussed biases in the literature on algorithmic discrimination: that "genetic datasets are skewed towards Caucasian populations, both because the price of genetic and epigenetic tests means certain groups are less likely to be able to afford them, and because the countries that have compiled wide-scale sequencing of their populations are majority white."

Dr. Martin-Herranz tells me that what this has meant is that "health technologies developed from these data are based on personalised genetic information biased towards white populations." He adds: "There are now several projects trying to correct for these biases, that is, that are trying to proactively sequence non-Caucasian genetic backgrounds. But there is still much work to be done."

Photo by British Library on Unsplash
Photo by British Library on Unsplash

AI and Coloniality

There are, indeed, several ways in which Dr. Martin-Herranz’s example of ‘coloniality’ – modern global inequality that reflects an extension of power relations between coloniser and colonised countries – manifests itself in AI more broadly. Structural racism is just one aspect of this, as a recently published paper written by Shakir Mohamed, a researcher at Google’s Deep Mind and his colleagues William Isaac and Marie-Therese Png, shows. To collate the datasets that power natural language processing algorithms that work with, for instance, Google’s Assistant, there is a laborious manual labelling of the datasets required. This work is actually known as ‘ghost work’, in that it is rarely human work that is openly acknowledged. Ghost work is mostly done by poorly paid workers – many times working unpaid overtime – from former UK and US colonies where English is broadly spoken.

Latin America, Central Asia and Africa have also largely been left out of the ethical discussions surrounding AI governance, countries in these areas are falling further behind legislation in these areas, and the global north continues to benefit from norms conceived around issues that concern their populations. Hao, for the MIT Technological Review, also reminds us that Cambridge Analytica beta-tested its algorithms on the 2015 Nigerian Elections, before the US and UK elections in 2016. Her example points to what Ruha Benjamin, a sociologist at Princeton, tells us in an episode of the podcast ‘Reset’: that "racialized groups have been targeted and included in harmful experimentations" throughout history. Benjamin adds: "scientists and doctors have gone after the most vulnerable populations in order to hone various technologies and techniques." In AI, this has not fundamentally changed. It is one reason why scholars examining issues of Algorithmic Bias point at the importance of letting those with less resources initiate their own AI projects, as a way to mitigate paternalistic approaches that make assumptions regarding these populations and their needs.

Photo by Joshua Sortino on Unsplash
Photo by Joshua Sortino on Unsplash

Clearly, we need to extend awareness of the types of biases that can be implicit in data collection, and the predictions that these data will then yield.

The answer lies not simply in auditing these systems: as Rebeca Heilweil suggests – and as the example in my research shows – you can only correct for the biases you are aware of. There are surely other ideological biases implicit in the NRC lexicon that need to be disinterred. Simply because an algorithm is tested for bias against women, for instance, does not mean it necessarily performs fairly for women of colour or transgender people.

Understanding these nuances requires knowledge of complex social theory and intersectionality (how interconnected systems of discrimination, for instance, sexuality and race, manifest themselves) – knowledge that those who work with and design AI systems might need to be more intimately acquainted with. Mohamed, Isaac and Png note that their examples aren’t entirely comprehensive, but what they underline is the overarching advantage that understanding ideas about coloniality and race theory brings to the table. There is clearly a need to collaborate on AI governance and policy in an inter-disciplinary fashion; as much as there is a need to value the insight gained by in-depth contextual ‘small data’ research in Social Science, and their contribution to social theory.

AI in public policy is a mammoth of an ethical conundrum – without even thinking about the issues of privacy involved (another elephant in the room). As the Centre for Democracy and Technology notes in their entry on Artificial Intelligence and Machine Learning, "what matters is that automated systems give users enough information to assess how much trust they should place in its digital decisions."

Openness and transparency must become fundamental pillars of AI implementation – especially where algorithms are handling serious, and potentially detrimental, decisions that pertain to public institutions such as the justice, criminal or police systems. I can critique the NRC lexicon because it is an open academic project, as is the Alleghany algorithm. Ideally, any algorithm that decides on our behalf, or that affects important areas of our life, public or otherwise – in banking, or to decide over our job eligibility – should be open to audit and in no way profit from our unawareness. Finally, for all this to fall into place, it is clear we need to continue to make AI democratisation, ethics and governance, an issue of public interest and debate.


Related Articles