Many natural language processing tasks can be solved with keyword-based approaches. For those that can’t, you’ll need more advanced NLP approaches. In this article, learn several ways to go beyond keywords, and discover which are best suited for various tasks.
Keywords Are Simple and Powerful
Keyword-based approaches are excellent baselines for various NLP tasks. Let’s explore some examples.
Phrase Mapping
The absolute simplest way to solve tasks such as document classification (either multiclass or multilabel) is what I call "phrase mapping." For each class, you simply define a set of keywords which, if present, will cause your model to assign the class. A straightforward extension that can reduce false positives is to add a set of negative keywords.
Depending on your task, you might get pretty good recall from this approach. And if you’re lucky and careful, you might get good precision too from the negative keywords. But I generally use this approach not as a baseline model but as a way to generate candidate documents to send to manual annotation.
One of the drawbacks of this approach is that it is very manual and time-consuming. It requires you to manually curate positive and negative keywords for each class.
Bag of Words
Bag of words is the colloquial name for the process of converting your documents into a term-document matrix. A term-document matrix has documents as rows and terms as columns. The matrix is very wide because each term in your vocabulary gets a column. The values in the matrix can be either counts (number of times a term occurs in a document) or normalized counts (such as tfidf).
Bag Of Words is similar to phrase mapping in that the order of words is ignored. And when paired with logistic regression (don’t pair bag of words with random forest), each term also gets a positive or negative coefficient. Of course, you need labeled data to train the logistic regression model, but assuming you have such data, bag of words is a far more automated way to accomplish basically the same thing that phrase mapping does.
Extensions: N-grams and Dimensionality Reduction
Bag of words can be extended to improve accuracy.
- N-grams – Add phrases (n-grams) to your model.
- Dimensionality Reduction – The regular term-document matrix has extremely high dimensionality (one dimension per term in your vocab), and this balloons with n-grams. Most implementations (such as TfidfVectorizer in scikit-learn) support sparse matrices, so you shouldn’t hit any memory or computational limits. But the term-document matrix treats each synonym, misspelling, abbreviation, and verb conjugation as separate. Applying dimensionality reduction techniques such as Latent Dirichlet Allocation or Latent Semantic Analysis can solve this problem.
Neural Network Options for NLP
The basic flaw in all of the keyword-based NLP approaches described above is that they all ignore the order of words. Now, there are plenty of tasks in which the order of words isn’t important, and such tasks are well suited for keyword-based approaches.
But when order matters, there are two neural network-based NLP approaches that may be right for your task. One nice thing about these approaches is that pre-trained models are publicly available, though only for some tasks. Even if your task has pre-trained models, you might need to use transfer learning to fine-tune for your data set. In any case, you can still use the pre-trained model components to transform your documents into vectors that can be used as features for your model.
Word Embeddings and Recurrent Neural Networks
Word embeddings provide a vector for each term. Vectors for terms with similar meanings are close in vector space. The most popular word embeddings are word2vec and GloVe.
Word embeddings can be paired with a recurrent neural network (RNN) that takes a sequence of word vectors as input. Each word in your document would get a vector, and these vectors can be fed into an RNN in order.
It’s worth noting that word embeddings are context-independent. The word "science" will always get the same vector regardless of how it’s used in your document. Paired with an RNN, your model will see the context. But if you want to extract feature vectors for your documents that are context-dependent, you might consider the next approach.
BERT and Transformers
BERT and its relatives have largely supplanted the word embedding / RNN approach for many NLP tasks. The pre-trained BERT model can provide a vector for each sentence rather than each word. These language models use the Transformer neural network architecture and are trained on large corpuses.
Gain Control with Linguistics
The approaches described above, both keyword-based and neural network-based, completely ignore linguistic features. I got by for several years in my data science and NLP career without ever considering part of speech tagging, dependency parsing, or named entity recognition. But I finally met a problem recently that exceeded the limits of non-linguistic NLP: insurance claims.
Insurance claim notes are full of jargon and semi-structured information (think phone numbers, injury descriptions, vehicle info, etc.). And the insurance claims process is incredibly complex, causing the salient information to be a needle in a haystack. Our team tried almost every approach described above to extract features from the notes. All of them failed.
Then we tried spaCy.
spaCy Makes Linguistic Features Accessible to All
Thanks to spaCy (and NLTK before it), you don’t need a degree in Linguistics to solve the hardest NLP problems. spaCy provides pre-trained models that do the linguistic heavy-lifting for you, including part of speech tagging, dependency parsing, and named entity recognition.
Let’s take a look at how to apply a spaCy pipeline to a text document.
Doesn’t get much easier than that.
The resulting doc
object is an instance of the spaCy Doc class, which is full of linguistic goodness. Let’s take a look at the parsed dependency tree using displaCy.

Don’t worry if this is Greek to you (it is to me, too). I’ll show you how to use it for feature extraction in a bit.
Let’s take a look at the named entity recognition output for a more complex sentence.

Lastly, let’s organize some of the token meta data to make it easy for us to write grammar rules later. (Check out all the attributes available in spaCy Tokens here).

Writing Grammar Rules with spaCy
So what’s the use of all these linguistic features? Well, they allow you to write deterministic rules based on grammatical structure. This is especially important when terms are negated, as in the example above.
Suppose you work for Beyond Meat, and you want to understand consumers’ relationship with your products. If you analyzed the sentence below using a bag of words / keyword approach, you wouldn’t get very far.
He does not eat meat, but he loves Beyond Burgers.
Let me demonstrate by randomizing the order of the words (remember, bag of words doesn’t care about order).
Beyond does loves not he meat, He but Burgers. eat
Almost poetic, and certainly hilarious. But shows the limits of bag of words.
Now let’s use spaCy to write a nice grammar rule that will tell us the verb associated with Beyond Burgers and it’s polarity.
(loves, 1)
We’ve extracted that someone indeed "loves" Beyond Burgers, even though the sentence has the word "not" early on.
Detecting Custom Entities with Spacy
You can also write rules to detect custom entities, such as Beyond Burgers. Here’s a simple example.

The EntityRuler class is convenient for simple rules, but I find that I almost always need greater flexibility. I also like to add extensions to spans and tokens. This tutorial has an example of a more customized pipeline step.
Annotation and Fine-Tuning with spaCy
As a final note, the pre-trained NER models in spaCy can be fine-tuned or extended by annotating your own data. Explosion AI, the makers of spaCy, also provide an annotation tool called Prodigy for this purpose.
spaCy Results for Insurance Claims
Let’s return quickly to the insurance claim use case mentioned previously. After trying and failing with every other approach we could think of, writing grammar rules and custom NER in spaCy finally allowed us to solve our task with near-perfect accuracy. On nearly 250 annotated documents, our rules were correct 100% of the time. That’s an in-sample / training data metric, but it shows just how controllable and flexible spaCy is.
Recap
To recap, keyword-based approaches to NLP are a great place to start. But when you need to go beyond keywords, word embeddings and Transformers may be helpful if your task already has pre-trained models. Otherwise, utilizing the linguistic features provided by spaCy will give you high flexibility and control, enabling you to achieve extremely high accuracy for your NLP problem.