Getting Started with NLTK in Python
Exploring some of the most common functions and techniques we can use to develop basic NLP pipelines.

NLTK (Natural Language Toolkit) is one of the first implementations of Natural Language Processing techniques in Python. Although it may seem a bit dated and it faces some competition from other libraries (spaCy, for instance), I still find NLTK a really gentle introduction to text methods in Python.
At first, using NLTK may seem a bit strange due to its multitude of methods, particularly for Python beginners. But actually, it is one of the more convenient libraries for Getting Started with simple NLP tasks as it has simple one-liner methods that one can call to perform some cool text transformations.
Of course, you shouldn’t expect to train state-of-art models using NLTK. Nevertheless, the library gives you a lot of tools for small pet projects and to develop small scale NLP projects. Additionally, it’s also one of the best libraries to get your first contact with NLP techniques.
This post will give you some brief explanations and examples you can use in NLTK, right away. We’ll do some code examples of the NLTK functions and discuss a bit about some alternatives we can use.
To make this code easily replicable, we’ll use the first couple of paragraphs from Python’s wikipedia page:
python_wiki = '''
Python is a high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.
Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0.[33] Python 2.0 was released in 2000 and introduced new features such as list comprehensions, cycle-detecting garbage collection, reference counting, and Unicode support. Python 3.0, released in 2008, was a major revision that is not completely backward-compatible with earlier versions. Python 2 was discontinued with version 2.7.18 in 2020.
Python consistently ranks as one of the most popular programming languages.
'''
Word Tokenizers
Naturally, computers can’t really make sense of huge chunks of text without some type of transformation. Tokenization is the first transformation that comes to mind when working on NLP pipelines.
"Tokens" is a really common expression in the NLP industry. Token is a subset of some particular text or a breakdown of the whole text. For instance, isolated words!
As there are multiple ways to chunk words out of text nltk
has some cool different implementations regarding tokenizers, namely:
- Whitespace Tokenizer
- TreeBank Tokenizer
Tokenizers split our sentences into tokens. These tokens can then be fed into multiple word representation algorithms such as tf-idf, binary or count vectorizers. Let’s start with the most simple one, whitespace tokenizer that splits the text based on blank spaces between words:
from nltk import tokenize
ws_tok = tokenize.WhitespaceTokenizer()
token_list = ws_tok.tokenize(python_wiki)
print(token_list[0:10])
Breaking down our code above:
from nltk import tokenize
— we start by importing the generaltokenize
module that contains different implementations of tokenizers.- We define an instance of
WhitespaceTokenizer
insidews_tok
. - We use the
ws_tok
instance to tokenize ourpython_wiki
text.
The print
statement yields the following:
['Python', 'is', 'a', 'high-level,', 'interpreted,', 'general-purpose', 'programming', 'language.', 'Its', 'design']
One of the main issues with WhitespaceTokenizer
is that it ends up slicing the text by each blank space forming some tokens that consist of word and punctuation.
For instance," interpreted,
" is considered a single token – notice the comma on that word. If you had another interpreted
, without the comma attached to it, that word would be considered a completely different token. This is a significant setback as words don’t really change their meaning if they are next to commas, full-stops or other punctuation.
Luckily, we have other tokenizers available, such as the TreeBank tokenizer. Some of the features of this tokenizer are (taken from the official documentation):
This tokenizer performs the following steps:
- split standard contractions, e.g.
don't
->do n't
andthey'll
->they 'll
- treat most punctuation characters as separate tokens – split off commas and single quotes, when followed by whitespace
- separate periods that appear at the end of line
Let’s do the same test with our python_wiki
text:
tb_tokenizer = tokenize.treebank.TreebankWordTokenizer()
token_list = tb_tokenizer.tokenize(python_wiki)
print(token_list[0:10])
Let’s check the print:
['Python', 'is', 'a', 'high-level', ',', 'interpreted', ',', 'general-purpose', 'programming', 'language.']
Cool! Some of the issues were solved. Notice that language.
is still a token that combines word and punctuation. Luckily, this is solved with the default word_tokenize
function that mixes TreebankWordTokenizer
and PunkSentenceTokenizer
(one of the tokenizers we didn’t test).
from nltk import word_tokenize
token_list = word_tokenize(python_wiki)
print(token_list[0:10])
Let’s see our first ten tokens:
['Python', 'is', 'a', 'high-level', ',', 'interpreted', ',', 'general-purpose', 'programming', 'language']
Cool! NLTK’s default word_tokenize
is the one that seems to isolate tokens in the sentence really well as it doesn’t attach any punctuation to our words in a single token.
As a summary, these tokenizers are great ways to split our text into tokens and later feeding them into other applications such as a simple sentiment analysis or word vectors. Our experiments were useful for you to understand that there isn’t a single "tokenizer". Although one can argue that word_tokenize
produces the best results, there are other alternatives and implementations one can try inside nltk
.
Stemming
Reducing the variance of your text may be one of the first experiments you do when trying to work with text data.
Text data is, inherently, high dimensional – there are literally hundreds of thousands of words in the English language and the dimensionality is similar with other languages.
Stemming may be a good choice to reduce the size of your text – stemming acts by cutting some suffixes from words. Stemming is a bit of a "brute-force" technique and is considered quite aggressive when cutting characters from your text. Like tokenizers
, stemmers come in different flavors – let’s see some implementations in NLTK, starting with PorterStemmer
:
from nltk.stem import PorterStemmer
porter = PorterStemmer()
porter_tokens = [porter.stem(token) for token in token_list]
Breaking down each instruction of our code:
- We start by importing
PorterStemmer
from thenltk.stem
module. - We define an instance of
PorterStemmer
in theporter
variable. - We stem each token using a list comprehension and passing each
token
inside theporter.stem
function.
Let’s print our first 10 stemmed tokens:
['python', 'is', 'a', 'high-level', ',', 'interpret', ',', 'general-purpos', 'program', 'languag']
Our stems are different from our original words in some cases – for instance the original word interpreted
is turned into interpret
. Stemming means that words that share the first letters will be considered the same, for instance:
interpretation
becomesinterpret
interpreted
becomesinterpret
interpret
staysinterpret
Some words are not stemmed, by definition. python
stays python
when we pass it through the stemming algorithm.
A more aggressive Stemmer is the LancasterStemmer
– we can also use itin nltk:
from nltk.stem import LancasterStemmer
lanc = LancasterStemmer()
lanc_tokens = [lanc.stem(token) for token in token_list]
Checking our lanc_tokens
:
['python', 'is', 'a', 'high-level', ',', 'interpret', ',', 'general-purpose', 'program', 'langu']
LancasterStemmer
will, generally, remove more characters from the text than PorterStemmer
but that doesn’t mean that, in certain instances, Porter
may stem words further when compared with Lancaster
.From the example above:
language
becomeslangu
inLancaster
andlanguag
inPorter
.- In reverse,
general-purpose
is stemmed inPorter
but not inLancaster
.
If you check the full text, you will find that Lancaster
reduces more of your text, normalizing it further.
Text Normalization is a cool technique to reduce the variance of your input and normalize similar words that may (or may not) convey the same meaning. Nevertheless, be mindful that everytime you are applying a Stemmer
, you are reducing information from your corpus.
There is an ongoing discussion if Stemmers
or Lemmatizers
(more about them next) should, nowadays, be part of most data pipelines, particularly with new Neural Network techniques being applied to text. One can argue that they will probably become obsolete in the future but, nevertheless, they are still valid when:
- You lack computational power to train huge numbers of parameters;
- You need to apply simpler models due to explainability or other underlying cause.
If you want to know more about Stemmers
, you can check my blog post on the matter here!
Lemmatization
Similar to stemming, lemmatization is another normalization technique with the goal of reducing the variance of your text.
The main difference is that instead of cutting the word into suffixes, it tries to get to the root of the word, commonly called lemma.
NLTK also contains an off-the-shelf implementation of WordNet Lemmatizer, a very famous lexical database that contains data regarding words’ relationship. Let’s check the WordNetLemmatizer
module:
['Python', 'is', 'a', 'high-level', ',', 'interpreted', ',', 'general-purpose', 'programming', 'language']
That’s interesting.. it seems our tokens are exactly the same as our sentence! The explanation is simple – we need to feed lemmatizer
with the Part-of-Speech tag of the word we are trying to reduce. Let’s pick up the word programming
from our tokens:
lemmatizer.lemmatize('programming')
yieldsprogramming
lemmatizer.lemmatize('programming', pos = 'v')
yieldsprogram
When we give the argument pos
, the lemmatizer is able to understand that the word programming is a verb related to the root "program". NLTK wordnet lemmatizer receives 5 pos
tags:
- ‘n’ for nouns – this is the default value. In practice, if we just call our
lemmatizer.lemmatize
without apos
argument, we will be considering all words as nouns; - ‘v’ for verbs;
- ‘a’ for adjectives;
- ‘r’ for adverbs;
- ‘s’ for satellite adjectives – not used very often;
And, of course, it would be extremely inefficient to feed these pos
tags manually. Luckily, nltk
has a super cool implementation to retrieve POS-Tags automatically!
POS-Tagging
As we’ve seen in the lemmatization section, Part-Of-Speech tagging consists of cataloging a word with it’s function on a sentence.
A specific word has a certain function (commonly called part-of-speech) in a sentence – an example, the sentence "I like learning Python
" contains 4 words with the following part-of-speech:
- The pronoun
I
- The verbs
like
andlearning
- The noun
Python
The tricky bit is that certain words may be the same syntactically but have different "function" in different sentences:
- I am washing the sink.
- I am going to sink in the water
The word sink
is written exactly in the same way but has a different part-of-speech tag. On the first sentence, sink
is a noun. On the second, it´s a verb
.
In NLTK
, we have an off-the-shelf POS tagger that we can use and that, luckily, avoids some of these issues:
import nltk
pos_tags = nltk.pos_tag(token_list)
print(pos_tags[0:10])
Our first 10 pos_tags
:
[('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('high-level', 'JJ'), (',', ','), ('interpreted', 'JJ'), (',', ','), ('general-purpose', 'JJ'), ('programming', 'NN'), ('language', 'NN')]
Contrary to some of the last examples, nltk.pos_tag
returns a list of tuples. Each tuple contains the word and it’s corresponding pos_tag
. Let’s see a couple of examples:
Python
is categorized as anNNP
– a proper singular Noun;is
is aVBZ
– a verb in the present tense.
From the preview, we see a lot of tags! Do we have to know them by heart? Of course not! We have two ways to understand what each tag means:
- Lookup on the Penn Treebank POS table.
- Run
nltk.help.upenn_tagset()
with the tag you want to check. For instance,nltk.help.upenn_tagset('NN')
returns a complete wiki of the tagNN
.
Let’s see if this pre-trained pos tagger is able to deal with our "sinking" problem – starting with the first sentence:
print(nltk.pos_tag(['I','am','washing','the','sink']))
This yields:
[('I', 'PRP'), ('am', 'VBP'), ('washing', 'VBG'), ('the', 'DT'), ('sink', 'NN')]
Cool, sink is a NN
here! Let’s see the other sentence:
print(nltk.pos_tag(['I','am','going','to','sink','in','the','water']))
This yields:
[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('sink', 'VB'), ('in', 'IN'), ('the', 'DT'), ('water', 'NN')]
Awesome! sink
is vb
– a verb!
This is because this pre-trained pos tagger takes into account the context of the word. As sink
is next to to
, the tagger immediately understands that it should tag this a verb as there is no other scenario where these words appear together and sink
is a noun.
You can also learn more about training your own pos taggers in my NLP Fundamentals Course on Udemy.
…
But, wait.. these tags don’t match the ones we had to feed to our lemmatization process! What’s going on here?
Luckily, we can convert them really easily with the following function:
def get_lemma_tag(pos_tag):
if pos_tag.startswith('J'):
return 'a'
elif pos_tag.startswith('V'):
return 'v'
elif pos_tag.startswith('N'):
return 'n'
elif pos_tag.startswith('R'):
return 'r'
else:
return ''
You just have to check what’s the starting letter of the pos_tag
function and convert it to the single letter version fed into the lemmatizer – let’s test:
lemmatizer.lemmatize('programming', pos = get_lemma_tag('VBZ'))
This outputs program
. It worked!
For simplicity, we are ignoring "satellite adjectives" in our function – they are not used that often and would require more complex rules.
POS Tags are a really important NLP concept. You’ll probably end up using your own trained taggers or more advanced ones such as spacy’s implementation. Nevertheless, NLTK’s version is still widely used and can achieve a nice performance that is sufficient for some NLP tasks.
N-Grams
So far, we’ve only considered our tokens as isolated words. In many NLP applications, it’s important to consider words coupled together as a single "token".
Negation, for instance, is a staple for bi-grams explanation. Bi-grams is an NLP way of saying "two consecutive words". Let’s consider the sentence:
- I did not like the theater.
In typical uni-gram fashion, our tokens would be:
- I, did, not, like, the, theater
In this case, it would be relevant to consider "not like" as a single token. Imagine a sentiment analysis application that would check relevant tokens to understand the overall "sentiment" of our text. In this case, it is clearly a negative sentence as the "agent" did not like the theater.
For this use case, it would be useful to understand how many "not like" expressions our text contains. If we just consider an isolated token "like", it would be hard for our algorithms to pick up this "negative sentiment" as the word ‘not‘ would also be isolated. Additionally, an isolated ‘not‘ does not necessarily represent a negative sentiment in itself – for instance in the sentence "I did not think the theater was bad."
In more advanced models, such as Neural Networks, the algorithms are able to pick up these bi-grams, tri-grams (three tokens) by themselves – and that is actually one of their strengths. But, for more simpler models (naive bayes, regressions, tree-based models) one has to explicitly give the n-grams as features.
NLTK has a quick implementation of bi-grams and tri-grams that we can pick up:
print(list(nltk.bigrams(token_list)))
Super easy! We just feed our original token_list
to nltk.bigrams
and…
[('Python', 'is'),
('is', 'a'),
('a', 'high-level'),
('high-level', ','),
(',', 'interpreted'),
('interpreted', ','),
(',', 'general-purpose'),
('general-purpose', 'programming'),
('programming', 'language'),
('language', '.'),
('.', 'Its'),
This is a preview of the first 10 bi-grams of our text. We now have a list of tuples with every single group of two words from our python_wiki
text.
What if we want three words? That’s easy with nltk
:
list(nltk.trigrams(token_list))
We have a function called trigrams
! Let’s see the first 10:
[('Python', 'is', 'a'),
('is', 'a', 'high-level'),
('a', 'high-level', ','),
('high-level', ',', 'interpreted'),
(',', 'interpreted', ','),
('interpreted', ',', 'general-purpose'),
(',', 'general-purpose', 'programming'),
('general-purpose', 'programming', 'language'),
('programming', 'language', '.'),
('language', '.', 'Its'),
Cool! Now we have a list of tuples where each tuple contains each pairing of three consecutive words in our text. Can we have more than tri-grams? Yep! With the generalizable ngrams
function that takes an extra parameter:
list(nltk.ngrams(token_list, 4))
This produces tetra-grams:
[('Python', 'is', 'a', 'high-level'),
('is', 'a', 'high-level', ','),
('a', 'high-level', ',', 'interpreted'),
('high-level', ',', 'interpreted', ','),
(',', 'interpreted', ',', 'general-purpose'),
('interpreted', ',', 'general-purpose', 'programming'),
(',', 'general-purpose', 'programming', 'language'),
('general-purpose', 'programming', 'language', '.'),
('programming', 'language', '.', 'Its'),
('language', '.', 'Its', 'design'),
N-grams are a pretty important concept for your NLP pipelines, particularly when you are dealing with simpler models that may require some clever feature engineering.
And we’re done! NLTK is a really cool library to explore if you want to get started with Natural Language Processing pipelines. Although you shouldn’t expect to build state-of-art models using NLTK, getting familiar with the library will be an excellent way to introduce yourself to cool NLP concepts.
_I’ve set up a course on learning the fundamentals of Natural Language Processing on Udemy where I introduce students to nltk, word vectors, and neural networks! The course is tailored for beginners and I would love to have you around._

Here’s a small gist with the code of this post: