The world’s leading publication for data science, AI, and ML professionals.

NLP-Preprocessing Clinical data to find Sections

Cleaning Medical Text

In this post we will be using healthcare chart notes data(doctor’s scribbled notes) to model topics that exist in Clinical notes. Keep in mind, there is no structure to write these notes.

In a later story, we will summarize these notes.

NLP Tasks that will be covered over 4 articles:

  1. Pre-processing and Cleaning
  2. Text Summarization
  3. Topic Modeling using Latent Dirichlet allocation (LDA)
  4. Clustering

If you want to try the entire code yourself or follow along, go to my published jupyter notebook on GitHub: https://github.com/gaurikatyagi/Natural-Language-Processing/blob/master/Introdution%20to%20NLP-Clustering%20Text.ipynb

DATA:

Source: https://mimic.physionet.org/about/mimic/

Doctors take notes on their computer and 80% of what they capture is not structured. That makes the processing of information even more difficult. Let’s not forget, interpretation of Healthcare jargon is not an easy task either. It requires a lot of context for interpretation. Let’s see what we have:

Image by Author: Text Data as Input
Image by Author: Text Data as Input

Things we immediately notice:

  1. This is plain text with no markups. If it did have markups, we could have used libraries such as beautiful soup
  2. The lines are artificially wrapped with new lines (whenever you see a single n)
  3. No typos.. woohoo, but too many Acronyms and capital letters
  4. There’s punctuation like commas, apostrophes, quotes, question marks, and hyphenated descriptions like "FOLLOW-UP"
  5. There is usage of a lot of sequenced data and hence the appearance of ‘1.’, ‘2.’ and so on. But, notice how there is actually a single line brake even before these numbers (like in 2)
  6. See how the de-identified names are all replaced with the likes of ‘Last Name’ or ‘First Name3’, ‘Hospital Ward Name’ Good thing is these are all on square brackets and easy to identify. These are also always followed by a parenthesis. Yay! something we can remove at the beginning itself!
  7. Notice how dates might be required to handle (if they are all not in the same format)
  8. Notice some formatting: such as : nn***nn or n??????tT. We will need to handle all of this

1. Preprocessing

a. Regex: We will clean our text using regex patterns as a pass 1. Yes, we do need multiple iterations of Cleaning!

Image by Author: The output of Regex Cleaning
Image by Author: The output of Regex Cleaning

b. Add Context and Lemmatize text: See, how we talked about Lemmatization and not stemming. It is important to understand the difference between the two.

In this section, we identify sub-topics within the chart using CAPITAL words extraction for "potential topics". Experts then manually tag each of these phrases as "T" or "F" for acceptance. Once we have that, we change the case of all other words. Wait, not so easy, remember we need to lemmatize.

Image by Author: 124 subtopics identified
Image by Author: 124 subtopics identified

Now, we will change the case of all CAPITAL words to lower case, unless they are the topics identified above

Find the Lemma for all the above topics by running it through a lemmatizer. We remove stop words within topics to catch more topics in the "capitalized list of words".

So, what changed here?

for index in range(len(lemmatized_topics_index)):
    print("Original Topic: %s n New Topic Index: %sn New Topic value: %s"%(topics[index],                                                                                       lemmatized_topics_index[index],                                                                lemmatized_topics_values[index]
)
)
    ##you can remove this to see all pairs
    if index>2:
        break
    print("nn")
Image by Author: The difference between lemmatization and stemming
Image by Author: The difference between lemmatization and stemming

Compare the stem of every topic and change it to lower case if it does not match to stem of lemmatized topics.

Why lemmatize first and then match only of the stem? Because lemmatization is done based on what Part of Speech the word falls in. However, stemming is just trying to find the root of the words by stripping out common plural(etc) alphabets.

So, we compare to the stemmed portion but, keep the lemmatized word of the identified topic as the final topic. This even allows us to get standardized topics across different chart notes

Image by Author: Cleaned Topics
Image by Author: Cleaned Topics

Here is our final list of topics in which analysts can parse and extract only the required information by using spans in Spacy. Isn’t that wonderful!

In my next post, I will talk about Text summarization.


Related Articles