Applied Text Mining in Python | University of Michigan

Includes the knowledge this course covered, the pros and cons of this course.

Pytrick L.

Published in

Towards Data Science

4 min readNov 27, 2019

This is not only a review but also a learning summary after finishing this course myself.

What you can take home after getting a certificate of this course

Introduction to Text Mining

Tech skills covered: python list, string.split(), string.lower(), s.startwith(t), s.endwith(t), t in s, s.isupper(), s.islower(), s.istitle(), s.isalpha(), s.isdigit(), s.isalnum()
Other String Operations: like: s.splitlines(), s.joint(), s.strip(), s.rstrip(), s.find(), s.rfind(), s.replace(u,v)
Handling text sentences
Splitting sentences into words, words into characters
Finding unique words
Handling text from documents

Regular Expressions

How to use python to write regular expressions: Meta-characters: Character matches, Character symbols, Repetitions
Regular expression for Dates
What are regular expressions?
Regular expression meta-characters
Building a regular expression to identify dates.

English and ASCII

Diversity in Text
ASCII and other character encodings
Handling text in UTF-8

Basic Natural Language Processing

Basic usage of NLTK: how to use nltk to remove stopwords, explore the functions with nltk text data.
Normalization and Stemming
Lemmatization
Tokenization
Sentence Splitting
Part-of-speech (POS) Tagging provides insights into the word classes/ types in a sentence
Parsing the grammatical structures helps derive meaning
Both tasks are difficult; linguistic ambiguity increases the difficulty even more
Better models could be learned with supervised training
NLTK provides access to tools and data for training

Text Classification

Examples of Text Classification: Topic identification, Spam Detection, Sentiment analysis, Spelling correction
Supervised Classification: Binary Classification, Multi-class classification(When there are more than two possible classes), Multi-label Classification( When data instances can have two or more labels).
Using Sklearn’s naive Bayes classifier
Using Sklearn’s SVM classifier
Model Selection in Scikit-learn
Supervised Text Classification in NLTK
Using NLTK’s NaiveBayesClassifier
Using NLTK’s SklearnClassifier

Naive Bayes and Support Vector Machine ( with case study)

Naive Bayes is a probabilistic model, it named naive because it assumes features are independent of each other, given the class label
Support Vector Machines tend to be the most accurate classifiers, especially in high-dimensional data
Strong theoretical foundation
Handles only numeric features: Convert categorical features to numeric features, normalization
Hyperplane hard to interpret

Identifying Features from Text

Types of textual features: words, characteristics of words, parts of speech of words in a sentence, bigrams, trigrams, n-grams

Topic Modeling and LDA

What is topic modeling
Working with Latent Dirichlet Allocation (LDA) in Python
LDA is a generative model used extensively for modeling large text corpora
LDA can also be used as a feature selection technique for text classification and other tasks

Semantic Text Similarity

Applications of semantic similarity: Grouping similar words into semantic concepts, Textual entailment, paraphrasing
Semantic similarity using WordNet
Path Similarity, Lin similarity, Distributional Similarity: Context
Finding similarity between words and text is non-trivial
WordNet is a useful resource for semantic relationships between words
Many similarity functions exist
NLTK is a useful package for many such tasks

Information Extraction

Information Extraction is important for natural language understanding and making sense of textual data
Named Entity Recognition is a key building block to address many advanced NLP tasks
Named Entity Recognition systems extensively deploy supervised machine learning and text mining techniques discussed in this course

The Pros of this course:

I would say this course covered almost all the necessary knowledge and application of text mining for someone who wants to apply text mining into the real-world; just as the name of this course implied, it is for someone who wants to apply.

And, it also contains the sudo code in the lecture slides and good assignments in the course. Learn by doing is always the best way for me to learn, and to remember things, without practicing, I’ll forget what I have learned in several days.

The cons of this course:

If, I must say something bad on this course, I would say it is too basic and doesn’t cover any advanced knowledge and skills in NLP, not even a brief introduction of NLP in neural networks. From my point of view, the instructor can introduce transfer learning on NLP tasks, or other advanced techniques in NLP, and give some reference if the student wants to explore more.

Another thing is, this course teaches the usage on multiple classifiers but only teach Naive Bayes in detail, and in my experience, Naive Bayes classifier is usually having a bad performance, and random forest classifier always has a good performance but the lecture didn’t even mention it, it is not appropriate because this course is applied text mining; or maybe this course was designed too many years ago, it needs update.

Overall, I will give a score between 4–4.5, maybe 4.25 is appropriate.

But if you are fresh new to Text mining, this course worth a 5 star to you.

Find the git repo for this course in my GitHub if you want to go through the course before taking it on Coursera.