Measuring Discourse Bias Using Text Network Analysis

Published in

Towards Data Science

10 min readOct 4, 2018

--

In this article I propose a method and a tool to measure the level of bias in discourse based on text network analysis. The measure is based on the structure of text and uses both quantitive and qualitative parameters of a text graph to identify how strongly biased it is. Therefore, it can be used by humans as well as be implemented into various APIs and AI to perform automatic bias analysis.

Bias: the Good and the Bad

Bias is commonly understood as inclination or prejudice towards a certain point of view. A discourse or text that has a bias may have a certain agenda or promote a certain ideology.

In the age of “fake news”, the rise of extreme ideologies and various misinformation techniques it is important to be able to identify the level of bias in discourse: be it social network posts, newspaper articles or political speeches.

Bias is not necessarily a bad thing. Sometimes it can make an intention stronger, push an agenda forward, make a point, persuade, dissuade and transform. Bias is an agent of change, however, when there is too much of it, bias can also be destructive. When we measure bias we measure how ideologically charged a text is, how much it wants to put forward a certain point of view. In some contexts — like fiction or highly charged political speeches — strong bias may be preferential. In some other contexts — like news or non-fiction — strong bias may reveal an agenda.

Currently there are no tools that can measure how biased a text is. Various text mining APIs categorize texts based on its content and sentiment, but there are no instruments that can measure the level of inclination towards a certain point of view in text. The instrument and the method proposed in this article can serve as the first step in this direction. The open-source online tool for text network analysis that I developed already can measure bias based on this methodology, so you are welcome to try it on your own texts and see how it works. Below I describe how the bias index works and some technical details.

Discourse Structure as a Dynamic Network

Any discourse can be represented as a network: the words are the nodes and their co-occurrences are the connections between them. The resulting graph traces the pathways of meaning circulation. We can make it more readable by aligning the clusters of nodes that are more densely connected (force-atlas algorithm) into the distinct groups marked with a specific color. We can also make the more influential nodes bigger on the graph (the nodes with the high betweenness centrality). You can read more about the technical details in this whitepaper on text network analysis.

For example, here’s a visualization of the TED talk by Julian Treasure called “How to Speak So People Will Want to Listen” made using this method. If you’re interested to look at the actual interactive graph, you can open it here.

From this graph we can clearly see that the main concepts are the notions of

“people”, “time”, “world”, “listen”, “voice” etc.

These concepts are the junctions for meaning circulation in that particular discourse. They connect the different communities of nodes (designated by distinct colors).

The algorithm works in a way that emulates human perception (following the landscape reading model, the idea of semantic priming, and also the common sense): if the words are frequently mentioned in the same context, they will form a community in the graph. If they appear in different contexts, they will be pushed away from each other. If the words are frequently used to connect different contexts together, they’ll appear bigger in the graph.

As a result, the structure of a text network graph can tell us a lot about the structure of the discourse.

For example, if the graph has a pronounced community structure (several different communities of words), the discourse also has several distinct topics, which are expressed in the text. In our example we have at least 4 major topics:

people — listen — speak (dark green)
time —talk —register (light green)
world—sound—powerful (orange)
amazing—voice (pink)

If we analyze other texts in the same way, we will see that the resulting graph structures are different. For instance, here’s a visualization of the first chapter of Quaran:

Text network visualization of Quaran made using InfraNodus. The structure of the graph is less diversified and more centralized. There are only a few main concepts, the discourse circulates around them, the rest of the text supports the main concepts.

It can be seen that it has a different network structure. It is much more centralized and less diversified. There are a few main concepts:

“god”, “people”, “believe”, “lord”, “give”

and the whole discourse circulates around these concepts. All the other notions are there to support the main ones.

We performed a similar analysis with the inauguration speeches of the US presidents from 1969 to 2013 and visualized the way their narrative changed over time:

Visualization of the US presidents’ inauguration speeches made using InfraNodus (TNA) and Gephi (visualization). It can be seen that over time the structure stays more or less the same, however, Obama’s speeches seem to have more distinct influential terms, indicating a more diversified discourse.

It can be seen that while the structure of the discourse stayed more or less the same over the years, while the emphasized concepts have changed with every address. This may indicate that rhetorical strategy stayed the same, while the content has transformed over the years. Obama’s speeches seem to have a higher number of distinct influential nodes, which may indicate a more diversified discourse.

Bias as a Conduit for Ideology in Networks

Now that we’ve shown how discourse can be represented as a network structure, we can discuss the notion of bias in the context of network science. We will use some ideas for epidemiology to demonstrate how network’s topology affect the speed and propagation of information across the nodes.

A network can be seen as a representation of interactions that happen over time, a diagram of traces left by a dynamic process. If we study topology of a network, we can get a lot of insights about the nature of the dynamic processes it represents.

In the context of social sciences and health care information about network structure can provide valuable insights for epidemiology: how fast a disease (a virus, an opinion or any other (mis)information) may spread, how far it may propagate, what the best immunological strategies may be.

It has been demonstrated (Abramson & Kuperman 2001; PastorSatorras & Vespignani 2001) that as a network structure becomes more randomized, its epidemiological threshold decreases. Diseases, viruses, misinformation can spread faster and to a higher number of nodes. In other words, as the community structure of a network is less and less pronounced and the number of connections increase, the network propagates information to more nodes and this propagation occurs in highly pronounced oscillations (infected / not infected).

A figure from the study by Abramson & Kuperman (2001) where they have shown the fraction of infected elements (n) in relation to time (t) for networks with a different degree of disorder (p). The higher the degree of disorder, the more elements get infected, the oscillations get more and more intensified,, but also the time-span of the infection is relatively short.

At the same time, when the community structure is pronounced while the network is relatively interconnected (small-world network), the “pockets” of nodes help maintain epidemic disease for a longer time in the network. In other words, less nodes may become infected, but the infection might stay longer (endemic state).

Representation of network structures: [a] random, [b] scale-free (better pronounced communities) and, [c] hierarchical (less global connectivity) (from Stocker et al. 2001)

In another study performed on various social networks (Stocker, Cornforth & Bossomaier 2002) it has been shown that hierarchically flat networks (i.e. disordered) networks are not as stable as the scale-free ones (i.e. the ones that have a more pronounced community structure). In other words, hierarchies may be good for passing down the orders, but scale-free structures are better for maintaining a certain worldview.

As we can see there is not one network topology that may be considered to be preferential. In fact it depends on the intention, the context, the situation. In some cases it can be good if a network can propagate information easily to all of its elements relatively fast. In some other cases stability can be more preferential.

Overall, the topology of a network reflects how well it can propagate information, how susceptible it is to the new ideas, whether the ideas will take over the whole network for only a short time or remain for a longer period.

The same approach can be applied when we study bias. The assumption here is that a discourse network is a structure that propagates ideas.

If the discourse structure is centered around a few influential nodes and there is no pronounced community structure, it means that the discourse is quite homogeneous and the ideas around those nodes will propagate better than the ideas from the periphery. We designate such discourse as biased.

If, on the other side, a discourse network consists of several distinct communities of words / nodes (scale-free small-world network) it means that there are several distinct topics inside the text and each of them is given equal importance within the discourse. We call such discourse diversified.

A network community structure can be identified not only qualitatively using a graph visualization but also through the modularity measure (see Blondel et al 2008). The higher the modularity is (usually above 0.4), the more pronounced is the community structure.

Another important criterium is the distribution of influence (via the most influential words / nodes) in different communities. For a discourse to be diversified the most influential nodes should be distributed between the different communities. We use entropy to measure the dispersal of influence in the graph and take this into account when identifying the level of bias. We also check if the top communities include a disproportionally high number of node, in which case the diversification score decreases and the number of components in the graph.

Therefore, we can identify the three main criteria we can use to identify the level of bias in discourse:

Community structure: how distinct they are and the % of nodes belonging to the top communities;
Influence distribution: how the most influential nodes / words are spread among different topics / graph communities;
Number of graph components: how connected the discourse is;

The Index of Bias Based on the Discourse Structure

Based on the propositions and the criteria above we propose the Index of Bias which takes into account the discourse structure and has four main parameters:

Dispersed (non-biased)
Diversified (locally-biased)
Focused (slightly biased)
Biased (highly biased)

The first value, Dispersed, is a discourse that has a highly pronounced community structure (several distinct topics) that are not very well connected or has several components (and hence no bias). Our tests show that such graphs are usually produced for poetry, personal notes, schizophrenic tweets and various other creative endeavours. For example, here’s a visualization of the Lord Byron’ poem “Darkness” (you can also check the interactive graph on InfraNodus):

Visualization of Lord Byron’s “Darkness” made using InfraNodus. The discourse structure is identified as Dispersed (see the Analytics pane to the right) because of the high modularity (0.68) and high influence dispersal (the most influential words are spread among the different communities and only 14% of the words are in the top community).

As we can see from the graph it’s quite sparse visually and our tool has identified the discourse structure as Dispersed because the modularity measure is quite high (pronounced communities / topics) and the influential nodes / words are distributed pretty evenly among the main topics (80% dispersal and only 14% of words in the top community / topic). If you read the poem itself you’ll see that it has quite a rich vocabulary and that it evokes a lot of diverse images, not trying to push a specific agenda (perhaps only through the poetic, not rhetorical means).

The next value, Diversified, is a discourse that has a pronounced community structure but where the communities are well-connected. Usually it indicates a discourse that reflects several different perspectives and gives them a more or less equal standing on the global level (local bias). Many articles and talks that aim to present several points of view, research notes, newspaper headlines (taken from a variety of sources) and non-fiction pieces will have this structure. For example, here’s a visualization of the news headlines (with teasers) from the 4th of October 2018 (see the interactive visualization here):

Visualization of the news headlines and teasers (via RSS) made using InfraNodus for the 4th of October 2018 taken from NYT, WSJ, FT, The Guardian and Washington Post. As we can see the selection of news is ranked as Diversified as the modularity measure is relatively high and yet the topics are also connected to each other. The most influential words are spread among the main topical clusters / communities, which indicates that the selection of news was quite diverse.

We can see that the discourse structure is ranked as diversified, which means that there are several distinct topics that are developed within this discourse and yet they are connected on the global level.

The third value, Focused, indicates a discourse that has a soft bias towards a certain topic. This usually means that the discourse presents several perspectives but focuses on only one, developing it further. Discourse structures with the Focused score are characteristic for newspaper articles, essays, reports, which are designed to provide a clear and concise representation of a certain idea. For example, here’s a visualization of the previous three parts of this article:

The previous three sections of this article visualized as a text graph using InfraNodus. We can see that the discourse structure is ranked as Focused, indicating a slight bias. The community structure is present, but they are not very distinct. Almost all the most influential words are concentrated in one community / topic: “network / structure / discourse” and then there’s a smaller topic with “text / bias / measure”.

Finally, the fourth type of discourse structure is Biased, which is characteristic for texts that have low or no community structure. The main ideas are concentrated together and all the other notions used in the text are there to support the main agenda. Such discourse structure can usually be observed in highly ideological texts, political speeches and any other text, which resorts to rhetorics to persuade people to take action. For example, here’s a visualization of The Communist Manifesto:

Text network visualization of the Communist Manifesto made using InfraNodus. The community structure is not pronounced and the most influential words belong to the two main topics and are highly interconnected. The rest of the discourse is subjugated towards the main agenda (class struggle).

Afterword

In this article I proposed a measure of discourse bias based on the structure of text network visualization and various parameters that can be obtained from graph analysis.

It is important to note that I don’t claim (yet) that the propositions I made are scientifically sound. A full study on a much larger corpus of data is on its way (which you are welcome to join).

My experience shows that this index can be useful when studying texts and it’s already implemented as a working feature in InfraNodus text network analysis and visualization tool.

Therefore I invite you to try it out for yourself and to send me any feedback, suggestions and propositions that you might have. Please, also feel free to leave any comments here, I’d be very curious to see what you think and how we can develop it further. InfraNodus is an open-source tool, so you are very welcome to join in and implement any propositions you might have as a code.

Dmitry Paranyushkin is a researcher with Nodus Labs and the creator of the text network visualization tool InfraNodus.

You can follow me and contact me here on Medium, on Twitter via @noduslabs or through Nodus Labs.