The world’s leading publication for data science, AI, and ML professionals.

Named Entity Recognition of IEEE Abstracts

Hands-on Tutorials

Harvard IACS Capstone Report


Authors: Paulina Toro Isaza, John Alling, You Wu, Justin Clark

Advisors: Professor Christopher Tanner, Isaac Slavitt

https://www.capstone.iacs.seas.harvard.edu/

Link to poster presentation: https://drive.google.com/file/d/1D9d1OmeamAHF6QhdLSQJUZPS6kxDGxfR/view?usp=sharing


Problem Statement

The Institute of Electrical and Electronics Engineers (IEEE) publishes a staggering 30% of the world’s literature in electrical and electronic engineering, telecommunications, computer engineering and related disciplines. These publications tend to contain characteristic entities, such as scientific concepts, technical products, and names of research organizations. Extracting these entities can provide a high-level overview for their large corpus of texts, which will enable researchers to quickly discover papers with highly relevant or similar content, and allow business stakeholders to detect publication trends.

Our project aims to address two tasks:

Named Entity Recognition

We will identify named entities in IEEE Xplore abstracts using the Named Entity Recognition (NER) natural language processing task. The three entity types of interest are: Methods, Products, and Organizations. Figure 1 illustrates the task on a randomly sampled abstract.

Figure 1. Sample text labeled with Method, Product, and Organization entities (Photo by the authors)
Figure 1. Sample text labeled with Method, Product, and Organization entities (Photo by the authors)

Coreference Resolution

We will extract unique entities from the discovered entities above by Coreference Resolution, an NLP technique that clusters expressions referring to the same entity.

Exploratory Data Analysis

IEEE Xplore is IEEE’s digital whitepaper repository that houses approximately 5M publications. These papers are spread across many fields of study, ranging from Computer Science to Virology. For this project, we’re only looking at the abstracts of these publications. We obtain the abstracts through snapshots of Microsoft Academic Graph (MAG) hosted on a relational database instance. MAG is a heterogeneous graph database of academic publications that includes publication metadata. In total, there are 4.78M IEEE abstracts on MAG. We were also provided with 1,300 abstracts that have been labeled with Methods, Products, and Organizations. We will call this the inherited dataset for the rest of this article.

Exploratory analysis of the datasets showed that the IEEE publications were heavily skewed in terms of the publication field of study (e.g. Computer Science, Political Science). Over 94% of all IEEE papers are in Computer Science, Engineering, Materials Science, and Physics. Figures 2 and 3 compare the distribution of IEEE publications by their fields of study in the MAG database and in the inherited dataset. The inherited dataset had oversampled from Engineering, and undersampled from less common fields, such as Art, History, and Philosophy.

Figure 2. Proportion of Top 5 fields across inherited abstracts vs. all IEEE abstracts (Photo by the authors)
Figure 2. Proportion of Top 5 fields across inherited abstracts vs. all IEEE abstracts (Photo by the authors)
Figure 3. Proportion of remaining fields across inherited abstracts vs. all IEEE abstracts (Photo by the authors)
Figure 3. Proportion of remaining fields across inherited abstracts vs. all IEEE abstracts (Photo by the authors)

We mapped labeled abstracts in the inherited dataset to their publication field of study (e.g. Computer Science, Medicine), and plot the number of labeled entities in each field. Figure 4 shows evidence that the occurrence of each entity class differs significantly across disciplines. For example, the annotators tended to pick up on Methods in Mathematics, Biology or Geology abstracts, and Products were more likely to occur in Computer Science or Chemistry abstracts.

Figure 4. Average annotated entities in abstract per fields of study (Photo by the authors)
Figure 4. Average annotated entities in abstract per fields of study (Photo by the authors)

Figure 5 shows the number of IEEE abstracts every year from 1879 to 2019. There has been a sharp increase in the number of publications since 2000, and the trend isn’t slowing down. Rapid advances in the scientific research community means that technical terms and language use in older papers can become obsolete quickly. To be forward-looking, we aim to train our models on newer abstracts in the database.

Figure 5. Number of IEEE publications every year (1879–2019) (Photo by the authors)
Figure 5. Number of IEEE publications every year (1879–2019) (Photo by the authors)

Our exploratory data analysis convinced us that the 1,300 inherited abstracts do not form a representative sample of the IEEE database. Larger fields of study were oversampled, and we find evidence that entities are not uniformly distributed across fields. This motivates the need to create samples that are representative by publication fields. We would also like to focus our analysis on newer abstracts, something that the inherited dataset was not built to do. Furthermore, after looking through the annotated dataset and discussing with IEEE, we found the labeling scheme to be inconsistent and not fully reproducible. Hence, the IACS team decided to create a new set of labeled data.

New IACS Labeled Dataset

We had 2 goals for our labeled dataset:

  1. Be representative of IEEE abstracts by fields of study
  2. Have consistent labeling scheme that is reproducible

For the first goal, we made sure to sample from the database such that the sample distribution of fields of study followed the actual distribution.

Table 1. Percentage of samples extracted for each field of study (Photo by the authors)
Table 1. Percentage of samples extracted for each field of study (Photo by the authors)

We also wanted to focus our analysis on newer abstracts, while still taking older abstracts into account.

Table 2. Percentage of samples extracted for each time period (Photo by the authors)
Table 2. Percentage of samples extracted for each time period (Photo by the authors)

For the second goal, we worked closely with IEEE to define the entity classes and created a set of annotation guidelines. However, as we will discuss later, this is a challenging task that has a significant impact on the model performance.

In total, the IACS dataset has 1,050 abstracts labeled by 4 annotators.

Named Entity Recognition

Modeling Approach

We adopted Bert-based models for the named entity recognition (NER) task. BERT (Bidirectional Encoder Representations from Transformers)[1], as the name suggests, is a transformer-based language model that leverages bidirectional training. The model was developed from self-supervised training on a large amount of text, specifically 800M words from BooksCorpus and 2,500M words from English Wikipedia. It was trained on Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks, and by the end of the training process, BERT has language processing capabilities that enable downstream supervised learning tasks such as NER.

Transformers involve two mechanisms: an encoder to read text input and a decoder to make predictions for a task. BERT only uses the encoder mechanism to generate contextual relations between words. A sequence of words (tokens) is passed into the model, embedded into vectors, and processed in the neural network. The output is a sequence of vectors, each corresponding to a token. Bidirectional training refers to how in contrast to directional transformers, which read input text from left-to-right sequentially, BERT reads both the left and right contexts of the input text. (Figure 6)

Figure 6. Pre-training architecture of a left-to-right Transformer (Left) versus BERT (Right). BERT representations are jointly conditioned on both left and right context in all layers. [1] (Photo by Devlin et. al)
Figure 6. Pre-training architecture of a left-to-right Transformer (Left) versus BERT (Right). BERT representations are jointly conditioned on both left and right context in all layers. [1] (Photo by Devlin et. al)

For training on MLM tasks, BERT masks 15% of the words from an input to predict on. Since such a small percentage of inputs are used to evaluate the loss function, BERT tends to converge more slowly than other approaches. BERT also trains next sentence prediction by pairing two sentences together, 50% of the time are subsequent sentences and 50% of the time are random pairings. This allows the model to learn sentence continuity.

We evaluated two models: (1) BERT which was pre-trained on generic texts from BooksCorpus and English Wikipedia, and (2) SciBERT [2] which was pre-trained on scientific texts from Semantic Scholar. 18% of the dataset is from the Computer Science domain, and 82% of the dataset is from the broad Biomedical domain. There is a 42% overlap between the BERT-Base vocabulary and SciBERT vocabulary, illustrating a substantial difference in frequently used words between scientific and general domain texts.

Multi-class to Multi-label scenario

For our NER task, we add a token-level classifier on top of the pre-trained BERT models. The classifier predicts an entity class (Methods, Products, Organizations, or None) for each word token. We started with a multi-class NER task, but later realized that our problem is a multi-label one. That is, we should be able to classify a word token with more than one class. For example, "IEEE" in "IEEE Xplore" should be labeled as both Organization (since it is the name of an organization) and Product (as it is part of the official product name). We re-annotated our dataset and made some modifications to the original multi-class models. Instead of predicting a single class for each word token, we now predict an array of classes. Each element in the array is a binary flag denoting membership in the class. We have also changed our loss function from cross entropy loss to binary cross entropy loss.

Results

For the BERT-Base and SciBERT models, we used 60% of the labeled data for training, 20% for validation, and 20% were kept for testing. We used Optuna to tune the model hyperparameters on the train and validation sets, and the best set of hyperparameters were used to evaluate the model on the test set. Figure 7 plots the binary cross entropy loss over training steps.

Figure 7. Loss as a function of training step for BERT-Base (Left) and SciBERT (Right) (Photo by the authors)
Figure 7. Loss as a function of training step for BERT-Base (Left) and SciBERT (Right) (Photo by the authors)

Table 3 reports the F1 score for each entity class. We report 10-fold cross-validated F1 scores for BERT-Base and SciBERT models on the IACS multi-label dataset with 200 test samples.

Table 3. F1 scores for BERT-Base and SciBERT models for each entity class. Mean and standard deviation over 10 folds reported. (Photo by the authors)
Table 3. F1 scores for BERT-Base and SciBERT models for each entity class. Mean and standard deviation over 10 folds reported. (Photo by the authors)

Cross-validation results in Table 3 show that our model performance is generally consistent across runs, though the BERT-Base model has larger variance in its performance for the Organization class.

Coreference Resolution

Modeling Approach

Coreference resolution refers to the resolution of references to the same entity. These references come in many forms. For this project, we are interested in references to the same entity by name or abbreviations, such as "Massachusetts Institute of Technology" and "MIT", or "k-Nearest Neighbors" and "kNN".

While reviewing the IEEE abstracts, it became clear that many of these abbreviations were defined at least once in the text. Definitions were almost always of the form "We used k-nearest neighbors (kNN) to cluster our records," where the abbreviation is in parentheses after the phrase it abbreviates. In our annotation guidelines, we decided that these abbreviations should be selected as part of the entity when they appear. That meant it was a fairly easy task to identify this pattern in our annotated abstracts and parse out the abbreviation and the full form. We found 339 instances of this form in the 3,074 labels of the Method class.

For each entity class, we then took all of the labeled entities plus these parsed entities and compared pairwise to find entities and abbreviations that were similar to one another. The number of comparisons we need to perform scales roughly quadratically with the number of entity labels, so we needed a very performant method. We considered a Locality-Sensitive Hashing approach such as datasketch, but ultimately used a somewhat faster and exact solution [3] as implemented in the Python package SetSimilaritySearch. This implementation required each entity label to be represented as an unordered set, which we chose to construct through downcasing and a character-based 2-shingle. For example, "Markov" becomes ("ma", "ar", "rk", "ko", "ov"). We then used the above package to find all pairs whose Jaccard similarity was greater than 0.85. To turn all these pairwise comparisons into clusters, we created a network graph of the entity labels and added every discovered pair as an edge. To extract clusters, we simply extracted the graph’s connected components.

Figure 8. Illustration of our coreference resolution procedure (Photo by the authors)
Figure 8. Illustration of our coreference resolution procedure (Photo by the authors)

Results

To assess performance, we reviewed all of the created clusters for each entity class and identified mis-clustered entity labels. As an example of mis-clustering, consider this cluster of Methods: ‘SA’, ‘simulated annealing’, ‘simulated annealing (SA)’, ‘spherical aberration’, and ‘spherical aberration (SA)’. It’s clear that "simulated annealing" and "spherical aberration" are not referring to the same process, but our method clustered them together because they are both abbreviated by "SA". Fewer than 4% of the entities were mis-clustered for each entity class, which we consider good performance. See Table 4 for more detail.

Table 4. Clustering performance for coreference resolution (Photo by the authors)
Table 4. Clustering performance for coreference resolution (Photo by the authors)

Discussion

Reviewing BERT model predictions

We reviewed the predictions made by our BERT-based models against the human labels. Figure 9 shows samples of good predictions made by our models. Both BERT-Base and SciBERT successfully identified the same entities and their corresponding classes as the human annotator.

Figure 9. Sample abstracts with good predictions by BERT and SciBERT (Photo by the authors)
Figure 9. Sample abstracts with good predictions by BERT and SciBERT (Photo by the authors)

When we looked at the failure modes of the models, we noticed that both BERT and SciBERT were especially prone to making mistakes at punctuations (e.g. dash or brackets) and trailing "s". It is likely that these terms appear more frequently in the texts and so they appear more often among the model’s failures. It is also possible that the model has difficulties with resolving edges of the entities. Figure 10 shows some errors made by the model.

Figure 10. Sample abstracts with errors made by BERT and SciBERT (Photo by the authors)
Figure 10. Sample abstracts with errors made by BERT and SciBERT (Photo by the authors)

In the first example, both BERT-Base and SciBERT captured only part of the entity. In the second sample, the human did not label anything in this abstract, but both BERT-Base and SciBERT picked up on "Information Centric Network (ICN)", which could have been an entity.

BERT-Base predictions by Field of Study

One of our findings from exploratory analysis was that different fields of study (e.g. Computer Science, History) differ in their vocabulary, context, and language use. For this reason, we made sure our labeled datasets included examples across the humanities, social sciences, and sciences in order to make sure the model could be applicable to any paper submitted to IEEE.

Figure 11. F1 scores by field of study for validation set (Photo by the authors)
Figure 11. F1 scores by field of study for validation set (Photo by the authors)

Figure 11 above shows that the model does perform differently across the various fields of study. We see that for two fields of study, History and Political Science, there were no method entities and so the F1 score was 0. The same was true on the training set. This suggests that we would need to increase the number of training samples for such fields if we would like to better evaluate the model’s field-specific performance.

We also see trends depending on the annotator team’s expertise. The highest F1 scores for the method entity tend to be for fields in which the annotation team has substantial expertise: mathematics, engineering, and computer science. (Philosophy and geology are outliers.) This suggests that annotator expertise has an effect on model performance.

Because fields like computer science, engineering, mathematics, and materials science make up about 85% of all IEEE abstracts and thus all training and validation samples, these are the fields who have the most impact on model performance and metrics. The overall F1 score for method is 0.64 and computer science, engineering, and mathematics have F1 scores of 0.65, 0.65, and 0.68, respectively. Meanwhile, materials science has an F1 score of 0.47.

Inner-Annotator Agreement

Even after defining the entity classes with IEEE and creating an annotation guideline, the team still found it difficult to label the abstracts. There were ambiguities (the guidelines don’t cover every case) and difficult jargon that the team grappled with. To quantify these difficulties, we measured how well the annotators agreed on their annotations of a common set of 50 abstracts.

To make our inter-annotator agreement (IAA) metric comparable to our model performance metric, we aligned entity labels with the token boundaries from our model’s tokenizer. This allowed us to compare one human annotator to another in exactly the same way we compare the model’s output to a human annotator. Following [4], we looked at all six possible pairs of our four annotators. For each pair, we arbitrarily picked one annotator as the "ground truth" and calculated the F1 score of the second annotator for each entity class against this ground truth. (The choice of annotator as ground truth does not matter – the resulting F1 score is the same either way.) This F1 score can be interpreted as the performance of the second annotator on the task of recovering the first annotator’s labels. This is effectively the same task the model is performing.

This process resulted in six F1 scores for each entity class – one for each pair of annotators. In Figure 12, we compare these six F1 scores with our best model’s F1 score. We can see that for the Organization and Method classes, our model performs better than some pairs of annotators and worse than others. The None class is much larger than the other classes and everyone does approximately the same. For the Product class, our model performs slightly worse than the pair of annotators that agree the least. As can be seen in the standard deviations of Table 3, some runs of this model perform better than that human pair with least agreement for the Product class, though the models then perform slightly worse for other classes.

Figure 12 illustrates that our model is performing on par with humans. There may be some small performance gains to be had for the Product class, but we have just about hit the performance ceiling imposed by the data quality. To take this model much further, it is likely the case that the task and class definitions would need significant refinement.

Figure 12. BERT model F1 performance relative to F1 agreement between pairs of annotators (Photo by the authors)
Figure 12. BERT model F1 performance relative to F1 agreement between pairs of annotators (Photo by the authors)

References

[1] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv:1810.04805 [cs], Oct. 2018, Accessed: Mar. 04, 2021. [Online]. Available: http://arxiv.org/abs/1810.04805.

[2] I. Beltagy, K. Lo, and A. Cohan, "SciBERT: A Pretrained Language Model for Scientific Text," arXiv:1903.10676 [cs], Sep. 2019, Accessed: Mar. 04, 2021. [Online]. Available: http://arxiv.org/abs/1903.10676.

[3] R. J. Bayardo, Y. Ma, and R. Srikant, "Scaling up all pairs similarity search," in Proceedings of the 16th international conference on World Wide Web – WWW ’07, Banff, Alberta, Canada, 2007, p. 131, doi: 10.1145/1242572.1242591.

[4] G. Hripcsak and A. S. Rothschild, "Agreement, the F-Measure, and Reliability in Information Retrieval," Journal of the American Medical Informatics Association, vol. 12, no. 3, pp. 296–298, May 2005, doi: 10.1197/jamia.M1733.

[5] J. Huang et al., "Few-Shot Named Entity Recognition: A Comprehensive Study," arXiv:2012.14978 [cs], Dec. 2020, Accessed: Mar. 04, 2021. [Online]. Available: http://arxiv.org/abs/2012.14978.

[6] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, "SpanBERT: Improving Pre-training by Representing and Predicting Spans," Transactions of the Association for Computational Linguistics, vol. 8, pp. 64–77, Dec. 2020, doi: 10.1162/tacl_a_00300.

[7] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky, "Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task," Accessed: Mar. 04, 2021. [Online]. Available: https://nlp.stanford.edu/pubs/conllst2011-coref.pdf

[8] J. Lee et al., "BioBERT: a pre-trained biomedical language representation model for biomedical text mining," Bioinformatics, vol. 36, no. 4, pp. 1234–1240, Feb. 2020, doi: 10.1093/bioinformatics/btz682.

[9] V. Stoyanov and J. Eisner, "Easy-first Coreference Resolution," in Proceedings of COLING 2012, Mumbai, India, Dec. 2012, pp. 2519–2534, Accessed: Mar. 04, 2021. [Online]. Available: https://www.aclweb.org/anthology/C12-1154.

[10] R. B. Tchoua et al., "Creating Training Data for Scientific Named Entity Recognition with Minimal Human Effort," Accessed: Mar. 04, 2021. [Online]. Available: https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=926228


Related Articles