The world’s leading publication for data science, AI, and ML professionals.

Temporal semantic network analysis

Extracting Temporal dynamics in large research corpus

Hands-on Tutorials

Introduction

Research knowledge has always been document-based (research papers, reviews), and many entities like Semantic scholar or Arxiv tried to aggregate this text knowledge in a big corpus where you can use keywords or IDs(DOI codes) to access specific paper or get inquiries about the state of a specific field, thus giving researchers one source of truth to check the literature and use it as a building base for their research.

However, the rapid evolution of new papers being committed to the corpus makes the task of tracking research trends or simply staying in the context very hard. Especially for a junior researcher who wants to get inquiries about the state of his field in order to build upon it.

To tackle this problem, many researchers and scientific entities volunteer to fill this gap by writing literature reviews that **** summarize the current state of research on a given topic. Although the number of these reviews is limited, and can’t scale to the volume of the scientific corpus.

In order to fix the stated problem, our research’s goal was to construct a text Network that encodes a given text corpus ( list of scientific papers ) and use graph theory tools and techniques to infer the research context, extract corpus state and look at the corpus as a temporal dynamic network in order to bring out temporal trends and patterns.

In this first scientific article, we will try to answer the above questions :

  • How can we encode text corpus into a text network?
  • How can we use graph theory tools to extract research communities?
  • How can we characterize and infer intern dynamics of a given research community?

From unstructured research corpus to a text network

Paper entity anatomy

A scientific corpus contains a set of papers that have a standard form (title, abstract, introduction, methodology, results, and discussion ). With this simple and unified structure, we define our paper entity as an object that contains the abstract since it’s usually a summary of the paper’s content, a list of the main keywords, and the date of publication.

Paper entity anatomy (image by Author)
Paper entity anatomy (image by Author)

The choice of these elements can be justified by two main reasons :

  • Limit the volume of processed and encoded text for computational reasons
  • Select the most informative part of a given paper

Text to graph

After defining our paper entity, it’s time to encode it as a graph object. For this, we build upon the existing pipelines that transform raw text into a text graph.

1- Word steeming

The abstract text is first tokenized by splitting the raw text into a list of words. The tokenized words are then steamed by keeping the root of words in order to reduce redundancy. For example model, modelization, and modeled are all steemed to model.

2- Stop word removal

The next step of our pipeline is to remove stop words that function as connectors and don’t carry any scientific meaning for example: and, or …

3- Construct abstract text graph

The next step is to convert the processed abstract text into an undirected graph, where words are graph nodes and their co-occurrence are the edges.

We construct graph edges by performing a 4 words window scan; in other terms, if two words occur in the same window, we construct an edge that links them and assigns a weight based on the distance between them. If two words occur another time, we sum the new weight with the previous one.

image by Author
image by Author

4- Extract keyword graph

If we recall our paper object, we have mined a list of keywords figuring in the paper. Our goal now is to extract a subgraph from our abstract graph only containing the list of keywords as nodes.

image by Author
image by Author

As for the new edges and their relative weights, we will construct them using the shortest path between each pair of extracted nodes.

image by Author
image by Author

Finally, we have a fully functional pipeline that takes a paper as input and outputs a relative text graph with its date of publication which will be used in the next steps.

Temporal graph construction

As mentioned before, the research corpus contains a list of papers. Thus, to be able to encode the text format corpus in a text graph, we should merge paper graphs in one big undirected graph.

But we should note that the research corpus is not static and keeps evolving and changing with the addition of new papers. That’s why we choose to construct a temporal network in order to preserve this dynamicity.

To accomplish our goal, we choose a set of timestamps in order to create snapshots of our temporal graph, so that we can store our graph and also be able to use static graph theory tools on these snapshots for future analysis.

Each time interval will be characterized by a batch of papers where their publication time is contained in this interval.

image by Author
image by Author

So we get the collection of papers published at a given time interval, create their relative text graphs, and then merge them into our graph in order to update its state and thus create a new snapshot of our temporal graph. This update is done by adding new nodes from our graph batch to the temporal graph and/or reinforce edge weights for existing nodes.

This iterative procedure gives us a set of snapshots for our temporal graph which shows us how it evolves over time in terms of knowledge creation (newly added nodes) or gaps fulfillment(new edges).

Temporal graph analysis: application to UM6P research corpus

In the last part, we have successfully prepared our pipeline that transforms a text corpus into a temporal text graph. Now it’s time to apply it to a real research corpus in order to test it, but also to do some data analysis and try to answer our introduction’s main questions.

To do that we use the UM6P research corpus, UM6P or Mohammed VI Polytechnic University is a Moroccan university Located at the heart of the future Green City of Benguerir. Their research department focuses mainly on sustainable development, mining, and agricultural sciences.

UM6P’s research corpus contains 260 research papers parsed from the Web Of Science website that was published between 2014 and 2020.

To prepare the text corpus for our graph pipeline, we construct a dataset containing the elements of our defined paper entity (paper’s abstract, list of keywords, and date of publication).

Samples from UM6P's research dataset
Samples from UM6P’s research dataset

Using the pipeline architecture presented earlier, we implement it in Python using Networks (a popular graph library) and Numpy for generic computations. The code can be found on my personal GitHub.

As for Visualization, we use the D3 package written in Javascript language, which uses a Force algorithm to make the graph visually appealing.

We pass our text corpus through our temporal graph construction pipeline by defining a time interval of 3 months, thus we have extracted 20 snapshots of our graph ready to be analyzed and processed.

We present below the last snapshot of our temporal graph. It’s a static network containing 1195 nodes (keywords in UM6P papers) and 3753 edges (links between them). With this visualization, it’s easy to see the fully evolved UM6P research corpus in one shot.

Snapshot of UM6P research graph at 12/2020
Snapshot of UM6P research graph at 12/2020

But still, the huge number of nodes doesn’t help us extract useful information about our research corpus, we need a filtered view of our network that contains only the main and important nodes. To this end, we chose "Node Degree", a node centrality measure that will enable us to score our nodes.

Filtred graph snapshot (image by Author)
Filtred graph snapshot (image by Author)

As you can see, not all nodes have the same importance. We can now extract highly connected nodes (the ones with a high node degree score) which refers to central keywords in our research corpus.

image by Author
image by Author

This high-level analysis helped us detect the main nodes, that may be potential hubs in our text network. If we project those results on our case study, the nodes we detected may refer to the main research subjects, whilst having many hubs in our network can be explained by the research diversity at UM6P.

Naturally, this first observation drives us to ask about the research communities in our corpus. Once detected, these communities will not only show us the closely related keywords in our graph, but they will also help us extract the dynamics related to each community.

Graph communities extraction

Graph community can be defined as a group of nodes that are densely connected internally and have sparser connections with other groups. Moreover, we have already identified the existence of potential hub nodes that may be the relative heart of each community.

Problem definition

Given this observation, we formulate our community extraction problem as a K-mean clustering problem where our centroids are the main central keywords in our graph and that have a scientific meaning, and our distance measure will be computed using the shortest path between nodes.

image by Author
image by Author

Main keywords extraction

As mentioned previously, static graph theory gives us access to a set of centrality measures to weight nodes relative to their importance in the network, which allows us to pinpoint the central nodes in the text graph by computing a score for each node. To do so, we opt for 4 centrality measures: Degree centrality, Betweenness centrality, Eigen-centrality, and Closeness centrality (you can refer to my previous article on centrality measurements).

centrality measures distributions
centrality measures distributions

We then compute the correlation matrix between our 4 metrics in order to choose which centrality measure will be compatible with our hub nodes scoring.

corr matrix
corr matrix

Three of our centrality measures have a high positive correlation coefficient, on the other hand, closeness centrality has a small one which is logical since high degree nodes for example tend to be hub nodes, and thus close to a subset of nodes in general (in our case the highest node degree is 122 compared to 1500 nodes in our graph ) so it will be far from the majority of the graph nodes, and then it will have a small node closeness centrality.

Since our goal is to extract the main nodes in our graph that may be potential hub nodes, we restrict our scoring to the first 3 correlated measures. This results in extracting the 6 main nodes which make up our topic nodes: Soil, Plant, Waste, Phosphate, Material, and Cellulose. To verify our node ranking, we use the Voterank algorithm that outputs the most influencing nodes in a network, which outputs the same list of nodes on top.

Distance measure formulation

Now, we need to define a distance measure in order to cluster our network nodes by their proximity to our predefined centroids(topic keywords). To do so, we choose the total distance of each shortest path as our distance. We should note that the distance between two nodes is constructed as the inverse of their edge weight. For example, we want to calculate the distance between "Funghi" a keyword from our graph, and the Soil topic.

image by Author
image by Author

Using Dijkstra’s algorithm, we can compute the path between our two nodes and also get the distance value. We do the same procedure for all nodes in our graph, then classify them by their proximity (min distance) to each topic node.

This gives us our 6 communities, given our 6 research topics. This community separation will help us conduct a more personalized analysis on our graph by inferring each community’s dynamics or exploring how they interact. The illustration below presents our graph after adding community annotation to each node.

graph communities
graph communities

Community dynamics Analysis

All the previous sections can be seen as tools that will help us perform an analysis on a given research topic. In fact, we have successfully constructed our temporal graph, extracted our research topic, and finally divided our big network into small and unique communities or subgraphs. From this, it’s easy to have access to the temporal evolution for each research topic. Now we will be able to define a set of metrics in order to ** characterize our community’s dynamics in terms of knowledge creation, its connectivity, and its activit**y.

image by Author
image by Author

Community knowledge creation

We start our graph tracking by quantifying knowledge creation over time. To do that we define new_k: a scalar value that computes the ratio of new nodes that were added to a given community subgraph over the total number of the graph nodes :

We compute this metric for each snapshot of a given network, the graph below visualizes its evolution for each community’s subgraph.

We observe that each community shows the same pattern in the early timestamps, which is a high ratio of new knowledge that attains for example 90% for the case of soil or 50 % for the case of the waste topic. This can be justified by the fact that our graph is in its first iteration and doesn’t contain a large set of nodes which gives the action of new nodes addition more weight.

As for the other timestamps, we see an exponential decrease in this ratio, which can be explained by an increase of each graph’s total number of nodes, accompanied by a pseudo constant addition of new nodes to the graph which makes the calculated ratio decrease over time, till it stagnates at the last recorded timestamps, since the number of nodes is already high and the ratio stabilizes at its stationary value.

We may observe some high variability in the ratio value for some communities like Phosphate and Waste. This can be explained by the fact that those communities contain a small number of nodes on average, compared to other communities, which makes the ratio have a relatively high variance.

Community connectivity evolution

Community connectivity is defined as the way knowledge (in our case keywords) connect with each other inside every community. Needless to say that this information is already encoded in our graph edges since every two keywords that may be linked have a weighted edge that connects them.

We start by exploring how those connections evolve in terms of the number of edges and eventually classify them by their role: edges that are created to connect new nodes or other that reinforce existing connections.

The graph below illustrates the evolution of the number of edges by research topic.

We observe that cellulose, soil, plant, and material communities have a high number of edges, as for the other communities the number of edges stays small compared to the first topics. This correlates perfectly with our previous observation when we saw that these last topics have a small number of nodes, thus they will need a small number of edges to connect with each other. As for the evolution, we can clearly see the inflection point in the Soil community, where the edge count line slope increases.

This general increase in the number of edges may be explained by two reasons :

  • Edges added to link new nodes to the graph.
  • Edges added to reinforce existing connections or connect existing nodes.

Our next task will be to emphasize this edge role to understand how the connectivity of a community evolves qualitatively. The graph below shows this role division for the Plant research topic.

Edge count evolution by edge role (plant)
Edge count evolution by edge role (plant)

We can see that the two types of edges don’t have the same level of evolution. In fact, it’s more probable to have an edge of reinforcement than having a connection edge, especially in the last timestamps where the graph already contains a lot of nodes. But why?

Actually, in the course of time, researchers tend to use the same semantic vocabulary or jargon as we say, thus fewer new nodes are added to our graph in higher timestamps. For example, a paper that was published in 2020 will probably contain keywords that have already been used in previous papers. The decrease in new nodes or keywords will result in fewer connection edges and more reinforcement edges.

This pattern was observed in all of our topics except the Waste research community where we see a surge in connection edges in the middle of our experiment time range. This may be explained by new discoveries in the field or a new pivot in their research agenda. This point of view stays subjective till justified quantitatively.

Edge count evolution by edge role (waste)
Edge count evolution by edge role (waste)

Extracting attention shifts in research activity

We have seen how each community evolves over time in terms of knowledge creation and connectivity, and in the last part, we saw that the temporal dynamics of our communities may differ. For example, the connection edges count surge in the Waste research topic was not observed in other communities. In this part, we will dig more into each community’s dynamics by analyzing research activity and see for example if a research community tends to specialize in a field by creating a subfield or stick to the main subject filling its scientific gaps or updating existing works either by new methods or reviews.

This task can be a little bit complex to do in one shot so we will stick to one side of it. In fact, in this part, we will try to track research activity and try to quantify the level of specialization in a given research community.

  • Definition :

Quantifying specialization level may be hard to do, so we assume this simple definition: A high level of specialization may occur when newly created knowledge doesn’t connect directly with the main subject of a given topic, for example, the artificial intelligence community is highly specialized since new papers don’t contribute directly to AI, but to its subfields like deep learning or reinforcement learning or more specialized ones as some specific network architectures (CNN, language models …).

Building on this example, if we imagine having a graph that encodes the AI research corpus, new nodes mined from a paper on GANs will not be directly linked to AI node but will be near deep learning node, thus we can use this distance difference to quantify the level of specialization in a given community.

  • Methodology :

Using the last example, we tried to construct a simple pipeline to track the level of specialization. We started by extracting each community’s subgraph from the big graph and storeing its snapshots over timestamps.

image by Author
image by Author

The next step is to compute the diff graph from each two graph snapshots, which will help us identify parts of our graph where change or activity had occurred.

image by Author
image by Author

For each diff graph, we assign a score to each node using centrality measures and also compute its distance from the topic main node using the Shortest path.

image by Author
image by Author
  • Results :

We do this transformation for each timestamp and summarize it in a temporal plot that tracks graph activity relative to the community central node. We use the Plant community as an example.

image by Author
image by Author

We observe that research attention varies over time. For example, we can see distant activity in several timestamps which means that the research activity is shifted toward subfields of Plant research.

We zoom for example on timestamp 3 where we can identify that majority of activity is centered on the ** Extracellular node and the _Photosyntheti_c node which relate to the study of** the light energy conversion in plants. As for other timestamps, the attention clusters near the main node _Plan_t, as we see in timestamp 6.

We also observe that most of the time, there is an activity near the Plant node which is justified since researchers tend to use it to build upon it.

From this graph, it’s easy to see how researchers’ attention shifts over time, and use it as a metric to track the level of specialization in a given research field. To make it more explicit, we compute the barycenter node for each timestamp using the distance from the main node combined with the node score(node size in the plot) as our weight. This helps us draw a line that we call the attention line. The variations of this attention line can help describe the shifts in researchers’ activity and give an idea about the level of specialization since distant barycentre from the main node indicates a high level of specialization in a given timestamp.

We move to another example: the waste research field where they study the valorization of phosphate waste or sludge :

We can see a research activity near the central node then a shift in the last timestamps which may be justified by a specialization shift. The main topics in the last timestamps are paper( Stone paper ) and roads.

To help verify our interpretation of the attention line, we add the last graph that tracks the ratio of new nodes that connect directly with the central node(direct edge).

We can easily observe this dependence between the attention line and ratio graph. In fact, the early timestamps are characterized by a high ratio of new nodes connected directly to the central node. On the other hand, we see that the attention line is dragged toward the central node (Waste). As for the last timestamps, we see a strong drop in the ratio equivalent to a shift in the attention graph where the attention line sways up from the central node. We can assign a simple justification to this behavior. In reality, phosphate’s waste valorization is a new research field here at UM6P, thus the first papers try to work on building a base knowledge on it, then ulterior papers start specializing by exploring its applications. For example, creating stone paper, which is a type of paper that can be made from phosphate waste.

Conclusion

Here, we’ve covered the base steps of computing a text graph from the unstructured research corpus to visualize it, conducted a deep analysis of its research communities’ dynamics, and have to quantified specialization level in a given topic by tracking researchers’ attention.

The aim of these steps is to automate the extraction of patterns and trends from text corpora, without the need for manual expertise. In fact, these building blocks can help make insightful reviews on a given field.

Our next article will discuss the possibility of inferring external dynamics in terms of connections between communities, to see which are close to each other and which are not, in order to give researchers an idea about knowledge gaps that need to be filled.


Related Articles