Thoughts and Theory

Data Science, Meaning, and Diversity

Epistemic Humility and Standpoint Theory for doing data science better

Ismael Kherroubi Garcia

Published in

Towards Data Science

14 min readApr 10, 2021

Colourful bokeh — a myriad if blurry and colourful dots, almost like data. If only we knew what they represented! — Image by PublicDomainPictures from Pixabay

Data science is, basically, the science of data. It seeks to generate actionable knowledge about the world by studying data. By “actionable knowledge”, I mean knowledge that can be drawn on by other researchers and policymakers to continue advancing how society interacts with the world and understands itself. Therefore, I assume that scientific inquiries generally seek to produce knowledge that can be acted on. However, the way in which we treat data in the context of data science will impact whether or not analyses result in actionable knowledge that is relevant to the real world. In this essay, I defend the need for epistemic humility and diversity — as it is proposed by standpoint theory — for data to be conducive to practical scientific knowledge.

This post is structured as follows. In section one, I introduce four features of data as it pertains to science at large. These four features respond to the following question: what makes data meaningful? I then expand on the four features as they appear in data science in particular. In section two, I spell out three problems inherent to data provided the four features. In section three, I introduce epistemic humility and standpoint theory as providing tools for the mitigation of the identified problems. In section four, I defend two assumptions that underpin this essay: (i) that multidisciplinarity can be problematic and (ii) that data are relate both to scientific expectations and represent phenomena.

1. Data in Data Science

In a previous post, I identified four features of data that can render it meaningful:

Interpretation: data carry meaning by virtue of their being interpreted — there are no mind-independent data (in my account of data);
Materialisation: for the interpretation of data, they must take some “tangible” form — this might be a .jpg document, a scribble on a notebook, a brick in a wall… — for their analysis and manipulation;
Context: understanding data points requires that we engage with how they interact with other datapoints within their datasets and even outside of these — to learn that a certain neighbourhood has a relatively low income per capita is quite different to engaging with how its inhabitants have been subject to institutional racism (e.g.: Razai et al., 2021); and
Metadata: these are information about the data in more intelligible language and, also, provide further details about the data themselves — for example, date of collection, original format, file size and so on.

The objective of those four features is for data to play a useful role in science. The requirement of “meaningfulness” responds to an account of science whereby it seeks to better understand the world not for the sake of it, but to help humankind better interact with one another and the environment. Whilst it is certainly true that scientific research can respond to less laudable desires, science generally advances human knowledge and, therefore, can enhance human livelihoods.

In the context of data science, the four features hold under scrutiny but raise questions that might be less relevant to other disciplines.

Interpretation in data science, for instance, raises an important question: who does the interpretation? Whilst science more broadly responds to this question quite easily — it is the research team who interpret the data — , data science is very often a multidisciplinary endeavour. Data scientists are often part of what I call multidisciplinary epistemic groups (more on MEGs here). In these groups, data scientists engage with experts in other fields who are less knowledgeable in big data analytics. Conversely, the data scientists do not have the know-how to engage fully with the theoretical foundations of the other field. To this effect, interpretation of the data is provided both by the data science team and the experts of the field they are engaging with (let us call these other-field experts). This is assuming a great deal of expertise on both team’s part and not much overlap between the two. With this limitation in mind — and the assumption does not hold in a great deal of data science work — , the other-field experts provide interpretation of the “raw data” they provide to the data scientists. The other-field experts categorise the raw data, label them and so on — they have greater expertise to do so. The data scientists then “clean” the data and standardise them to conduct computational, mathematical and statistical analyses on them. Once this is done, they return a new form of data — with further insights — to the other-field experts. A new layer of interpretation is offered up by the other-field experts, as they integrate the new insights into their field’s theoretical framework.

Materialisation of data in data science is a critical aspect because data must be collected and materialised for computational analysis. Indeed, data science works with computational tools that might require the data to be in certain formats and meet other criteria. The question posed in a data science project also requires that the data meet certain specifications. With this in mind, in the context of data science, I understand data as quantifiable units of information. Data are information that have been extracted from the world — “collected” by the other-field experts — and abstracted for the purpose of their analysis through quantitative methods. Data are materialised for data scientists when they are quanitifable and have — possibly only — mathematical features. This has an impact on how data science understands the context of data.

Context of data is lost by the nature of data science tools and methods. This may be a bold claim, but it acknowledges the importance of the data wrangling process that follows their collection. Data must be manipulated to meet the criteria for computational approaches. Consider a study of a country’s census, where individuals become values according to categories of gender, age, annual income and so on. Data science applied to a census has nothing meaningful to say about any one individual other than provide some comparisons to averages, but at no moment does data science tell us of people’s identities. This is not problematic — the census is not supposed to provide deep insights to individuals in the population — , but it does highlight the distance between the data that data scientists study and the contexts from which they originate (much more on this was said by the authors of Data Feminism at The Alan Turing Institute). Just to add fuel to the fire, the design of data science methodologies and tools needn’t even rely on the collection of “real” data. Modern approaches to the creation of such tools rely on what is called “synthetic data”. Synthetic data amounts to data that contains certain features that we expect in “real data”. Synthetic data help artificial intelligence systems and machine learning models compensate for otherwise insufficient quantities of data, or data that are difficult to anonymise. In short, synthetic data are artefacts created by reflecting in individual (“unreal”) data points parameters that are appear in “real data”. The synthetic data, when analysed together, reflects similar distributions of those parameters as what we find in the “real data”. I do not wish to go into synthetic data any further (I couldn’t if I tried), but mention it just to emphasise that data science can work with truly abstract and decontextualised data.

Metadata might provide a chance for data science to overcome the problem of decontextulaisation. After all, metadata are data about the data being analysed. However, metadata are often limited to describing a datapoint’s date of entry, size, format, and so on. Metadata could, theoretically, provide context — I have even suggested that metadata could include information about a project’s ethical considerations. The particular form of data materialisation in data science also emphasises the importance of metadata (here is my attempt at making this point in a creative way).

2. Problematising Data in Data Science

The above four features of data in data science have raised three challenges for the field. Just to recapitulate:

The multidisciplinarity problem relates to the need for data scientists to engage with other-field experts and vice versa. Neither of the two can fully engage with what the other has to say about data because of their respective degrees of expertise in their own fields;
The abstraction problem arises in data needing to be cleaned for their fitting to tools at the data scientist’s disposal; and
The decontextualisation problem relatedly comes about from the mathematisation of data that might otherwise be meaningful to a much wider public.

These problems are all intricately linked. Consider the increasing specialisation and complexity of data science methodologies and tools. This means that less other-field experts can engage with them, so the multidisciplinarity problem is worsened, the data are further abstracted and manipulated to work for increasingly sophisticated tools, and their context is left even more astray.

The overarching problem here is the detachment of what data science studies — data — from the reality that we expect science to help us understand. I call this the ontological problem of data science (and hopw to write about it elsewhere). To partially overcome this — and the above three problems — , I now introduce the need for more accurate science communication and for the “anchoring” of data.

3. Solving Data in Data Science

We have already seen that metadata — as a feature of data — might provide much needed context to the data that data scientists analyse. However, metadata are often only accessible to fellow scientists who know where to look for them. Metadata are also not often used in the way I suggested above. Their potential is not often harnessed.

Another way to overcome some of the problems is through greater transparency regarding what data science can achieve. We may call this “epistemic humility”; roughly, the acknowledgement of the limitations of data science. This can manifest in three interrelated ways. Firstly, data science can acknowledge the particularly abstract nature of the data they analyse. This can help data scientists communicate more clearly that their findings pertain to abstractions rather than more tangible phenomena in the world. This can help with the second form epistemic humility in data science can take: scope limitation. As it stands, data science is seen as an all-powerful tool for advancing human knowledge in all fields. Whilst there is some truth to this, communicating the abstract nature of its findings can help other-field experts approach data science with more caution — data science is not, ultimately, a silver bullet. Thirdly, being more humble about one’s expertise in multidisciplinary collaborations can empower other-field experts to be more confident in their analyses. The stages of interpretation in such collaborations (as introduced in section 1) must sit with the corresponding experts when they are not trained in one another’s discipline. Whilst dialogue will be necessary to understand the scientific question that the other-field expert asks and then again when the data scientists present their findings, a great deal of trust in one another’s expertise is necessary, and this trust is best bestowed from a position of humility.

A further solution for the three problems identified is what I call “anchoring”. Anchoring is the process whereby data are once again contextualised and “un-abstracted”, so to speak. The point of anchoring is for data scientists to consider the data they analyse as referents for more ordinarily salient realities. Anchoring is remembering that data are abstractions of more significant worlds and experiences. In a sense, “recontextualising” and “un-abstracting” do you not quite capture what is going on here. Let us take a step back.

Ultimately, data are representations of phenomena in the world (this is not the case for synthetic data, as mentioned earlier on). They have been extracted in some way from the world and then abstracted to quantifiable units of information for the purpose of their analysis. Anchoring boils down to acknowledging that this process leaves us with data that are “only” representations. The process of anchoring is about remembering what the data represent; it allows for the data scientist to be considerate in their analyses. This increased consideration, in turn, allows for the benefits we already described when discussing more transparent science communication.

The sense in which I speak of anchoring is inspired by the case made by feminist epistemology. One such strand of feminist epistemology is standpoint theory. In short, and with no intent of providing an exegesis of this rich philosophical field, standpoint theory seeks to provide a voice for those who are traditionally marginalised. Standpoint theory acknowledges that there is a great deal of understanding about the world that is not present in scientific enquiry. This is precisely because of that marginalisation. However, this is not to say that anybody who is marginalised in society has access to scientifically valuable knowledge on the basis of their marginalisation. The marginalised voice that must be brought into the scientific community is that of somebody who has learnt of their oppressed situation, of the structures of injustice that have shaped their lives and world views. To this effect, standpoint theory in data science allows for two useful transformations.

Firstly, the socially aware marginalised voice can value the data they analyse as carrying more meaning when pertaining to their very communities and lived experience. For example, consider a scientific project that analyses data about some behavioural patterns amongst people of a specific marginalised community. This project will have a great deal to gain from engaging with researchers from that very community who have become aware of unjust power structures that maintain the marginalised status of their community. Suddenly, the data that had been abstracted and quanitifed are treated with a greater deal of care. They represent significant phenomena which lose meaning if analysed by data scientists who do not recognise this.

Secondly, and just as importantly, by recognising the growing importance of data science within science at large, it should be clear that it has a great deal of power in a world that is so driven by theory. The results of artificial intelligence, machine learning and so on that feed into the technologies that we engage with in our day-to-day lives mean that data science holds a position of power in the world. Data science has the ability to perform and project current injustices into the future. It would be wrong to allow data science to recruit mostly those who have historically held positions of power. Indeed, the feminist epistemologist brings to our attention the performative nature of systemic unfairness, marginalisation and discrimination. If we allow data science to continually build on unjust assumptions and feed into unreflexive technologies that perpetuate the injustices of the world, it is hard to imagine data science providing very much knowledge that actually helps us better understand and engage with our surroundings.

4. Some nuance

A summary of the thoughts so far:

Data become useful in science when interpreted, materialised and contextualised, and when clarified with metadata.
Data in data science are extracted and abstracted from reality into quantifiable units of information.
Data in data science raise the problems of multidisciplinarity, abstraction and decontextualisation.
Epistemic humility and standpoint theory highlight the value of diversity in data science, insofar that it helps overcome the three problems.

There are two further underlying assumptions that are worth unpicking. Firstly, there is a question of multidisciplinarity and whether it is in fact true that data scientists and other-field experts do not have shared understandings. Secondly, it is implied throughout the text that data are “mere” representations of reality rather than anything more meaningful.

On the multidisciplinarity problem, it is the case that I did provide an over simplification. However, it is worth clarifying that this was not done to undermine the expertise of either the data scientist or the other-field expert. The assumption arises from the literature on sociology of science that clearly demonstrates the need for academics and researchers to prioritise expertise in their training over breadth of knowledge (see, inter alia, Poteete et al., 2010). Of course, this does not account for the many researchers who have a background in some field other than data science and then train in data science methods and tools. On this note, I must say that data science, in the way that I am considering it, ascends to a field of its own rather than a set of tools and methods one picks up along the way. Although, now I must clarify that this is not to talk down of those who do you become experts data scientists after receiving extensive training in some other field. However, I will maintain the position that these particular researchers are interdisciplinary in their training and, therefore, in their work. I do not have them in mind when speaking of “data scientists” in this essay. Ultimately, it was necessary for me to pin down some basic notion of multidisciplinarity for the purpose of advancing the arguments in this text. It remains as an interesting question for sociologists and philosophers of science to pursue the precise nature of multidisciplinarity, transdisciplinarity and interdisciplinarity in data science.

Regarding the way I refer to data, it is not far off from Leonelli’s (2016a) relational account, whereby data “portability” and “materiality” are prominent aspects (see Leonelli, 2016b). By this account, data are also collected with the expectation of becoming evidence for claims about the world (ibid.: 197). However, I have spoken of data as “representing” phenomena, which Leonelli argues to be the different, representational account. By this account, materiality — a .pdf or .jpg format, say — is a feature rather than a core value of data. By the representational account, data carry meaning throughout time and space in virtue of there referencing phenomena in the world, regardless of their format. However, Leonelli explains that this account is problematic because it constrains what the data can refer to. Indeed, they can only represent individual phenomena. This does not hold in a world of data science where data are abstracted and analysed through scientific models that may not represent the data’s initial phenomena. But this abstraction is precisely what I have problematised in this essay. Perhaps I ought to clarify that the case has been made not for avoiding abstracting data to new models tout court, but to acknowledge the source and representational value of data as much as possible.

5. Concluding

This essay has introduced data in science as being meaningful by virtue of having four properties: interpretation, materialisation, contextualisation and metadata. These four properties, in relation to data science, raised the problems of multidisciplinarity, abstraction and decontextualisation. I suggested that these three problems can be counteracted by epistemic humility, on the one hand; and diversity, on the other. Respectively, epistemic humility amounted to acknowledging the limitations of one’s expertise; and diversity was introduced as the sort prescribed by feminist standpoint theory. Two strong assumptions were then explored: the narrowness of scientists’ expertise, and this paper’s account of data. Expertise narrowness was defended on the basis of academic infrastructures, as studied by sociologists and philosophers of science. This essay’s account of data was then found to blur the boundaries between the relational and representational accounts, but the discussion helped clarify my overarching claim: that acknowledging the the source and representational value of data in data science is important if we are seeking actionable knowledge.

Thank Yous!

To Jonathan Schulte for the frustratingly rigorous philosophical input — I look forward to your thoughts on data!
To Maria Eriksson for the lengthy chats and making sure I do not get carried away by unwarranted epistemic arrogance!
To Arielle Bennett and Malvika Sharan for the insightful and extensive comments!

About That Epistemic Humility…

I have struggled significantly to write the above. Mostly because I am not a data scientist and because data science raises enormous questions I am simply incapable of wrapping my head around. Previous versions of this essay included mentions of “data science as a theory of ontology”, for example. Ludicrous things that boggle my mind even more than synthetic data! (I hope I did not get that bit too wrong and that it is useful to people entirely outside of data science).

At The Alan Turing Institute, I am surrounded by brilliant minds doing fascinating work, yet I can see how, more broadly, data science as a provider of solutions can be taken for granted.

I joked around recently about the increasingly fast advancement of science and, well, the need to pause and smell the flowers. I think what I am actually asking is for when data scientists have a chance to engage with Philosophy of Science. Or how philosophers of science can engage more meaningfully with Data Science.

What I am trying to say, really, is: I hope you — having reached the end of my essay — feel encouraged to question how things are traditionally done. Listen to panels about Doing Better in Data Science, watch documentaries about the powers that artificial intelligence and the likes respond to, and engage with people who know a lot more than me! But also thank you. I really do appreciate you taking the time to drop by!