Applied Text Mining in Python | University of Michigan

Includes the knowledge this course covered, the pros and cons of this course.

Pytrick L.
Towards Data Science

--

This is not only a review but also a learning summary after finishing this course myself.

What you can take home after getting a certificate of this course

Introduction to Text Mining

  • Tech skills covered: python list, string.split(), string.lower(), s.startwith(t), s.endwith(t), t in s, s.isupper(), s.islower(), s.istitle(), s.isalpha(), s.isdigit(), s.isalnum()
  • Other String Operations: like: s.splitlines(), s.joint(), s.strip(), s.rstrip(), s.find(), s.rfind(), s.replace(u,v)
  • Handling text sentences
  • Splitting sentences into words, words into characters
  • Finding unique words
  • Handling text from documents

Regular Expressions

  • How to use python to write regular expressions: Meta-characters: Character matches, Character symbols, Repetitions
  • Regular expression for Dates
  • What are regular expressions?
  • Regular expression meta-characters
  • Building a regular expression to identify dates.

English and ASCII

  • Diversity in Text
  • ASCII and other character encodings
  • Handling text in UTF-8

Basic Natural Language Processing

  • Basic usage of NLTK: how to use nltk to remove stopwords, explore the functions with nltk text data.
  • Normalization and Stemming
  • Lemmatization
  • Tokenization
  • Sentence Splitting
  • Part-of-speech (POS) Tagging provides insights into the word classes/ types in a sentence
  • Parsing the grammatical structures helps derive meaning
  • Both tasks are difficult; linguistic ambiguity increases the difficulty even more
  • Better models could be learned with supervised training
  • NLTK provides access to tools and data for training

Text Classification

  • Examples of Text Classification: Topic identification, Spam Detection, Sentiment analysis, Spelling correction
  • Supervised Classification: Binary Classification, Multi-class classification(When there are more than two possible classes), Multi-label Classification( When data instances can have two or more labels).
  • Using Sklearn’s naive Bayes classifier
  • Using Sklearn’s SVM classifier
  • Model Selection in Scikit-learn
  • Supervised Text Classification in NLTK
  • Using NLTK’s NaiveBayesClassifier
  • Using NLTK’s SklearnClassifier

Naive Bayes and Support Vector Machine ( with case study)

  • Naive Bayes is a probabilistic model, it named naive because it assumes features are independent of each other, given the class label
  • Support Vector Machines tend to be the most accurate classifiers, especially in high-dimensional data
  • Strong theoretical foundation
  • Handles only numeric features: Convert categorical features to numeric features, normalization
  • Hyperplane hard to interpret

Identifying Features from Text

  • Types of textual features: words, characteristics of words, parts of speech of words in a sentence, bigrams, trigrams, n-grams

Topic Modeling and LDA

  • What is topic modeling
  • Working with Latent Dirichlet Allocation (LDA) in Python
  • LDA is a generative model used extensively for modeling large text corpora
  • LDA can also be used as a feature selection technique for text classification and other tasks

Semantic Text Similarity

  • Applications of semantic similarity: Grouping similar words into semantic concepts, Textual entailment, paraphrasing
  • Semantic similarity using WordNet
  • Path Similarity, Lin similarity, Distributional Similarity: Context
  • Finding similarity between words and text is non-trivial
  • WordNet is a useful resource for semantic relationships between words
  • Many similarity functions exist
  • NLTK is a useful package for many such tasks

Information Extraction

  • Information Extraction is important for natural language understanding and making sense of textual data
  • Named Entity Recognition is a key building block to address many advanced NLP tasks
  • Named Entity Recognition systems extensively deploy supervised machine learning and text mining techniques discussed in this course

The Pros of this course:

I would say this course covered almost all the necessary knowledge and application of text mining for someone who wants to apply text mining into the real-world; just as the name of this course implied, it is for someone who wants to apply.

And, it also contains the sudo code in the lecture slides and good assignments in the course. Learn by doing is always the best way for me to learn, and to remember things, without practicing, I’ll forget what I have learned in several days.

The cons of this course:

If, I must say something bad on this course, I would say it is too basic and doesn’t cover any advanced knowledge and skills in NLP, not even a brief introduction of NLP in neural networks. From my point of view, the instructor can introduce transfer learning on NLP tasks, or other advanced techniques in NLP, and give some reference if the student wants to explore more.

Another thing is, this course teaches the usage on multiple classifiers but only teach Naive Bayes in detail, and in my experience, Naive Bayes classifier is usually having a bad performance, and random forest classifier always has a good performance but the lecture didn’t even mention it, it is not appropriate because this course is applied text mining; or maybe this course was designed too many years ago, it needs update.

Overall, I will give a score between 4–4.5, maybe 4.25 is appropriate.

But if you are fresh new to Text mining, this course worth a 5 star to you.

Find the git repo for this course in my GitHub if you want to go through the course before taking it on Coursera.

--

--