
When you decide to learn a new skill, there are often multiple challenges that you need to overcome until you master that skill. You need to have a solid idea of what you need to do/learn to master the idea; you need to know what resources you can use, you need to be able to distinguish between good and bad resources so you won’t waste your time on the wrong ones.
Perhaps the toughest step you’ll need to take is to learn the skill’s language. If we talk precisely about natural language processing (NLP) as the new skill you’re aiming to learn, then you need to learn the language of the field before you even start going through tutorials and videos.
When developers and data scientists write online tutorials or create videos about a specific topic, there’s always the assumption that if you’re here – at the tutorial or video – then you have an idea about what’s going on. So, they will use the technical terms assuming that the reader/ watcher knows their meaning.
If you don’t know these terms meaning, though, reading or watching these tutorials can be quite a hassle. Because then, you will need to stop, Google the term and get back to the tutorial. Which, as you may imagine – or have experienced – is not ideal.
My goal in writing this article is to have a one-stop article for anyone curious about natural language processing to know the meaning of the commonly used terms in the field. Because once you do, you will be able to read any article or watch any video about natural language processing with minimal confusion.
Let’s dig in…
№1: Corpus
Natural language processing is a unique field that combines computer science, Data Science, and linguistics all together to enable computers to understand and use human languages. From that perspective, corpus – Latin for body – is a term used to refer to a body of text. The plural form of the word is corpora.
This text can contain one or more languages and can be either in the form of written or spoken languages. Corpora can have a specific theme or can be generalized text. Either way, corpora are used for statistical linguistic analysis and linguistic computing.
If you’re using Python to build your projects, the Gensim package can help you construct corpora from Wiki or Wiki-based articles.
№2: Stemming
In natural language processing, stemming is a technique used to extract a word’s origin by removing all fixes – prefixes, affixes, and suffixes. The main purpose of stemming is to give the algorithm the ability to look for and extract useful information from a huge source, like the internet or big data.
Various algorithms are used to perform stemming, including:
- Lookup tables. A form that has all possible variations of all words (similar to a dictionary).
- Stripping suffixes. Remove suffixes from the word to construct its origin form.
- Stochastic modeling. A unique type of algorithm understands suffixes’ grammatical rules and uses that to extract a new word’s origins.
You can perform stemming in Python by using predefined methods in the NLTK package.
№3: Lemmatization
Although stemming is a good approach to extract word origins, sometimes removing fixes is not enough to obtain the correct word’s origin. For example, if I use a stemmer to get the origin of paid, it will give me pai. which is incorrect.
The downsides of stemmers often appear when dealing with irregular words that don’t follow the standard grammar rule. Here where lemmatization comes to help.
Lemmatization is a word used to deliver that something is done properly. This case refers to extracting the original form of a word— aka, the lemma. So, in our previous example, a lemmatizer will return pay or paid based on the word’s location in the sentence.
The NLTK package also offers methods that can be used to extract the lemma of a word.

№4: Tokenization
In natural language processing, tokenization is the process of chopping down a sentence into individual words or tokens. In the process of forming tokens, punctuation or special characters are often removed entirely.
Tokens are constructed from a specific body of text to be used for statistical analysis and processing. It’s worth mentioning that a token doesn’t necessarily need to be one word; for example, "rock ‘n’ roll," "3-D printer" are tokens, and they are constructed from multiple words.
To put it simply, tokenization is a technique used to simplify a corpus to prepare it for the next stages of processing.
In Python, the package NTLK offers methods to perform tokenization, such as sent_tokenize and word_tokenize. Moreover, NLTK offers tokenizers for other languages besides English.
№5: Lexicons
When presented with a natural language processing task, we need to consider more than just the language. We need to consider how some terms can be used in a specific context to mean something in particular. For example, "line of scrimmage," "kicker," and "blitz" are terms used to describe different aspects of American football.
In linguistics and NLP, lexicons are part of the grammar of a language that includes all lexical entities. A lexical entity deals with the meanings of a word when it’s used in different situations and circumstances.
Lexicons are essential for having more accurate results out of your natural language processing models. For example, when conducting a sentiment analysis of some tweets, knowing the topic around the tweets, the colloquial ways of describing things can make a big difference in the analysis results.
№6: Word Embeddings
Computers don’t understand words, so if we want them to analyze and use our languages properly, we need to present them so that they can understand. Also, analyzing text can be challenging, and numbers can be easier for algorithms and computers.
In natural language processing, word embedding is a technique used to map out words to real numerical vectors for analysis purposes. Once these vectors are formed, they can be used to train models, construct neural networks and use deep learning techniques.
Various algorithms can be used to implement word embedding, mainly:
- Embedding Layer. A layer implemented at the front end of a neural network to extract word embedding. The corpus must be cleaned and prepared before applying it to this layer.
- Word2vec. A statistical technique that learned word embedding efficiently from a corpus to make training the neural network efficient.
№7: N-grams
In text analysis tasks, n-grams refer to diving the corpus into n-words chunks. These chunks are often constructed by moving one word at a time. When n =1, we use the term unigrams instead of 1-gram. In case n = 2, we call it bigams, and when n = 3, it’s called trigrams.
You can calculate how many grams a sentence will have using a simple equation:
Where x is the number of words in the sentence and n is the desired number of grams.
In Python, It is relatively easy to write a function that constructs the n-grams of a sentence. However, if you don’t feel like implementing yourself, NTLK and textblob packages offer methods that can help you generate n-grams automatically.

№8: Normalization
When we want to analyze text for any purpose, the analysis process can be much more accurate if the text we are using is in a standard format. Putting the text in a standard format is what’s called normalization. For example, if we search within a text, it will be better if the entire text was in either upper or lower case.
Normalization is often conducted after tokenizing a text and a query. Then, we may have two similar phrases but not a 100% the same, such as USA and U.S.A. But, you want your model to match these two terms together regardless of the small differences.
Normalizing a text can have both good and not so good effects on your natural language processing model. On one side, normalizing leads to better matching in search tasks. On the other side, converting everything to lowercase or uppercase may interfere with the overall application’s reliability.
№9: Named Entity Recognition (NER)
In any natural language processing task, we are often asked to read, clean, and analyze a huge corpus. That’s why most of the terms in this list are techniques that can make the analysis easier and more efficient.
Named-entity recognition is another natural language processing technique that extracts more information about some text by labeling the different words into predefined categories such as people, place, time, email, etc.

Performing NER can make further analysis of the text more accurate. You can perform NER in Python using packages such as Spacy and NLTK.
№10: Parts-of-speech (POS) Tagging
Another useful analysis technique is identifying the different parts of speech within a specific text or a sentence. POS tagging results in a list of tuples; each tuple contains the word and its tag. The tag is a description of the word’s part of speech, is it a verb, noun, adjective, etc.
In most applications, we initially use a default tagger to get basic POS tagging that we can then enhance. The NLTK package offers a default tagger that can give you the basic tagging of any text.

Takeaways
Every field has its own set of terminology that people in the field often use to describe specific processes and steps that make it easier to communicate with each other and explain their work efficiently.
Sometimes these terminologies are words that you may have came across before and know their general meaning. But, that meaning may not be 100% accurate for that specific field. Other times, these terminologies are a set of words put together to point at sometimes new.
Regardless of their origin, understanding these terminologies is an essential step towards understanding the field, reading any resources discussing the field, and eventually mastering that field.
This article presented you with the basic terminologies of natural language processing that you will find in most articles and videos describing any field aspect. Hopefully, knowing the meaning of these terms will make it easier for you to engage with the resources, build new projects, advance in your learning journey, and land your dream career.