In this post we will be using healthcare chart notes data(doctor’s scribbled notes) to model topics that exist in Clinical notes. Keep in mind, there is no structure to write these notes.
Doctors take notes on their computer and 80% of what they capture is not structured. That makes the processing of information even more difficult. Let’s not forget, interpretation of Healthcare jargon is not an easy task either. It requires a lot of context for interpretation. Let’s see what we have:
Image by Author: Text Data as Input
Things we immediately notice:
This is plain text with no markups. If it did have markups, we could have used libraries such as beautiful soup
The lines are artificially wrapped with new lines (whenever you see a single n)
No typos.. woohoo, but too many Acronyms and capital letters
There’s punctuation like commas, apostrophes, quotes, question marks, and hyphenated descriptions like "FOLLOW-UP"
There is usage of a lot of sequenced data and hence the appearance of ‘1.’, ‘2.’ and so on. But, notice how there is actually a single line brake even before these numbers (like in 2)
See how the de-identified names are all replaced with the likes of ‘Last Name’ or ‘First Name3’, ‘Hospital Ward Name’ Good thing is these are all on square brackets and easy to identify. These are also always followed by a parenthesis. Yay! something we can remove at the beginning itself!
Notice how dates might be required to handle (if they are all not in the same format)
Notice some formatting: such as : nn***nn or n??????tT. We will need to handle all of this
1. Preprocessing
a. Regex: We will clean our text using regex patterns as a pass 1. Yes, we do need multiple iterations of Cleaning!
In this section, we identify sub-topics within the chart using CAPITAL words extraction for "potential topics". Experts then manually tag each of these phrases as "T" or "F" for acceptance. Once we have that, we change the case of all other words. Wait, not so easy, remember we need to lemmatize.
Image by Author: 124 subtopics identified
Now, we will change the case of all CAPITAL words to lower case, unless they are the topics identified above
Find the Lemma for all the above topics by running it through a lemmatizer. We remove stop words within topics to catch more topics in the "capitalized list of words".
So, what changed here?
for index in range(len(lemmatized_topics_index)):
print("Original Topic: %s n New Topic Index: %sn New Topic value: %s"%(topics[index], lemmatized_topics_index[index], lemmatized_topics_values[index]
)
)
##you can remove this to see all pairs
if index>2:
break
print("nn")
Image by Author: The difference between lemmatization and stemming
Compare the stem of every topic and change it to lower case if it does not match to stem of lemmatized topics.
Why lemmatize first and then match only of the stem? Because lemmatization is done based on what Part of Speech the word falls in. However, stemming is just trying to find the root of the words by stripping out common plural(etc) alphabets.
So, we compare to the stemmed portion but, keep the lemmatized word of the identified topic as the final topic. This even allows us to get standardized topics across different chart notes
Image by Author: Cleaned Topics
Here is our final list of topics in which analysts can parse and extract only the required information by using spans in Spacy. Isn’t that wonderful!