The world’s leading publication for data science, AI, and ML professionals.

Public Benchmarks for Medical Natural Language Processing

A general introduction to a list of canonical tasks and corresponding datasets to measure your medical natural language processing

Source: https://unsplash.com/
Source: https://unsplash.com/

The field of natural language processing (NLP) has evolved really fast in recent years. Breakthroughs like transformer, BERT, GPT have emerged one after another. Practitioners of all industries are exploring how to leverage the exciting development of NLP in their specific business domains and workflows [1]. One such industry that stands to benefit greatly from the improvement of NLP is healthcare. The vast amount of free text medical notes carry incredible data insights, which can inform better care provision, cost optimization, and healthcare innovation. To measure the efficacy of applying NLP to the medical field, we need good benchmarks. This blog post lists the canonical public benchmarks for the common tasks in medical natural language processing. The goal is to provide a starting point for healthcare Machine Learning practitioners to measure their NLP endeavours.

Entity/Relation Recognition

The task of entity/relation recognition is to detect and categorize the medical concepts in free text and their relations. It is a crucial step in gaining better understanding of actionable insights from clinical notes and reports. The canonical dataset for this is Informatics for Integrating Biology and the Bedside (i2b2) [2]. The dataset contains de-identified patient reports from a few partnered medical organizations with 394 training reports, 477 test reports. The labeled medical concepts are of type problems, treatments, and tests. The labeled relations include things like treatment improves problem, test reveals problem, problem indicates another problem, and so on.

Here is a concrete example:

1 The patient is a 63-year-old female with a 3-year history of bilateral
  hand numbness
2 She had a workup by her neurologist and an MRI revealed a C5-6 disc
  herniation with cord compression

—----------
# Lines are numbered. Words are indexed starting from 0.
—-----------
# Entity || type
bilateral hand numbness 1:11-13 || problem
a workup 2:2-3 || test
an MRI 2:8-9 || test
a c5-6 disc herniation 2:11-14 || problem
cord compression 2:16-17 || problem
—-----------
# Entity || relation || entity
an MRI 2:8-9 || test reveals problem || ac5-6 disc herniation 2:11-14
an MRI 2:8-9 || test reveals problem || cord compression 2:16-17
a c5-6 disc herniation 2:11-14 || problem indicates another problem || cord compression 2:16-17

Only a full recognition is considered correct. That means for an entity, both the start and end word indices of the entity need to be accurate; and for a relation, the left entity, the right entity, and the relation all need to be accurate. The final evaluation metrics are based on precision, recall, and F1 score.

Semantics Similarity

Semantics similarity evaluates the semantic equivalence between two snippets of medical text. Clinical Semantic Textual Similarity (ClinicalSTS) [3] is a canonical dataset for this task. It contains 1642 training and 412 test de-identified sentence pairs. The equivalence is measured by an ordinal scale of 0 to 5, with 0 indicating complete dissimilarity and 5 suggesting complete semantic equivalence. The final performance is measured by the Pearson correlation between the predicted similarity scores Y' and human judgement Y, and is calculated by the formula below (the higher the result, the better):

Pearson Correlation Formula
Pearson Correlation Formula

Here are two concrete examples:

# sentence1
# sentence2
# similarity score

minocycline 100 mg capsule 1 capsule by mouth one time daily
oxycodone 5 mg tablet 1-2 tablets by mouth every 4 hours as needed
3

oxycodone 5 mg tablet 0.5-1 tablets by mouth every 4 hours as needed
pantoprazole [PROTONIX] 40 mg tablet enteric coated 1 tablet by mouth Bid before meals
1

Natural language inference

Natural language inference evaluates how well a medical hypothesis can be derived from a medical premise. MedNLI [4] is such a dataset. It contains de-identified medical history notes from a group of deceased patients. The notes are segmented into snippets, and human experts were asked to write 3 hypotheses based on each snippet. The 3 hypotheses are

  1. a clearly true description
  2. a clearly false description and
  3. a description might be true or false,

representing 3 relations of the premise-hypothesis: entailment, contradiction, and neural. The dataset contains 11232 training pairs, 1395 development pairs, and 1422 test pairs.

Here is a concrete example:

# sentence1
# sentence2
# relation
Labs were notable for Cr 1.7 (baseline 0.5 per old records) and lactate 2.4
Patient has elevated Cr
entailment

The final performance can be measured by the classification accuracy of the relations given the premise-hypothesis pairs.

Medical question choice-answering

Medical question choice-answering emulates the choice-answer medical exams. MedQA [5] is the canonical dataset for this purpose. Its questions are collected from medical board exams in the US and China where human doctors are evaluated by picking the right answer. It contains 61097 questions.

Here is a concrete example:

A 57-year-old man presents to his primary care physician with a 2-month
history of right upper and lower extremity weakness. He noticed the weakness
when he started falling far more frequently while running errands. Since then,
he has had increasing difficulty with walking and lifting objects. His past
medical history is significant only for well-controlled hypertension, but he
says that some members of his family have had musculoskeletal problems. His
right upper extremity shows forearm atrophy and depressed reflexes while his
right lower extremity is hypertonic with a positive Babinski sign. Which of
the following is most likely associated with the cause of this patient's
symptoms?

A: HLA-B8 haplotype
B: HLA-DR2 haplotype
C: Mutation in SOD1 [correct]
D: Mutation in SMN1
E: Viral infection

Mechanically, this task can be treated as a scoring system where the input is the question+answer_i, and the output is a numeric score. The answer_i with the highest score will be the final answer. The performance can be measured by accuracy on a 80/10/10 split of the dataset. This creates a comparable benchmark for the model and human expert performance.

Medical question answering

Medical question answering is the most complex form of medical NLP task. It requires the model to generate long form free text answers to the given medical question. emrQA [6] is a canonical dataset for this purpose. It has 400k question-answer pairs. Such a dataset would be very expensive to acquire relying only on human experts’ manual efforts. Therefore, emrQA is actually semi-automatically generated by

  • first polling medical experts on the frequently asked questions,
  • then replacing the medical concepts in those questions with placeholder and thus creating templates of questions,
  • and finally using annotated entity-relation (such as i2b2) dataset to establish the clinical context, fill in the questions, and generate the answers.

Here is a concrete example:

Context: 08/31/96 ascending aortic root replacement with homograft with
omentopexy. The patient continued to be hemodynamically stable making good
progress. Physical examination: BMI: 33.4 Obese, high risk. Pulse: 60. resp.
rate: 18

Question: Has the patient ever had an abnormal BMI?
Answer: BMI: 33.4 Obese, high risk
Question: When did the patient last receive a homograft replacement?
Answer: 08/31/96 ascending aortic root replacement with homograft with omentopexy.

Mechanically, this task can be seen as a language generation task where the input is the context+question, and the output is the answer. Final performance can typically be measured on a 80/20 split of the dataset, and by the exact match and F1 score. Exact match measures the percentage of prediction that matches the exact ground truth. F1 score measures the "overlap" between the prediction and ground truth. In this setting, both the prediction and the ground truth are treated as a bag of tokens where true/false positive/negative can be calculated.

Conclusion

Researchers and practitioners continue to vigorously apply natural language processing (NLP) in the medical space. While it’s exciting to see the enthusiasm, it’s important to have public and reproducible benchmarks to measure the performance of such applications. This blog post lists the typical tasks, corresponding public datasets, and applicable metrics for this purpose, which can serve to quantify the potential improvement of new medical NLP applications.

References

[1] How to Use Large Language Models (LLM) in Your Own Domains https://towardsdatascience.com/how-to-use-large-language-models-llm-in-your-own-domains-b4dff2d08464

[2] 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3168320/

[3] The 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity: Overview https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7732706/

[4] MedNLI – A Natural Language Inference Dataset For The Clinical Domain https://physionet.org/content/mednli/1.0.0/

[5] What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams https://arxiv.org/abs/2009.13081

[6] emrQA: A Large Corpus for Question Answering on Electronic Medical Records https://arxiv.org/abs/1809.00732


Related Articles