Natural Language Processing is one of the hottest fields in AI today. Although most of the concepts seem hard to grasp, these fundamentals should help you get started in the field and become prepared for more advanced topics.

Natural Language Processing (NLP) is one of the hottest fields in Artificial Intelligence nowadays. Technological innovations and huge breakthroughs in the field enable computers to understand human language to a level that we did not thing it was possible a couple of decades ago.
Some of these technological breakthroughs enable you to write a book in less than 24 hours or build your own chatbot.
For beginners, the field might seem confusing and overwhelming. Where should you start if you want to have a career in NLP? What should you prioritize when building your study plan? Should you jump immediately to study Neural Networks? Should you understand text normalization? What are the fundamentals that you should study before jumping into more advanced stuff?
Don’t you worry! This post will guide you through some of the fundamentals of NLP and will show you six fundamental skills to help you kick-start your Natural Language Processing career.
Let’s start!
Basic Text Processing
One of the most important concepts to deal in programming languages is to learn what are the most important methods for text processing. Dealing with strings in a computer language should be second nature to you – knowing how to manipulate text back and forth, using regular expressions, slicing strings are some of the most important examples that you should be able to master when working in Natural Language Processing.
Why is that? The probability that your corpus (piece of text that you will use to build your NLP application) will get to you cleaned and ready to analyze is really low. Real life data is pretty complex and you will probably have to perform some cleaning tasks such as lower-casing some words, removing white spaces or others.
Where should you start? The Python string library and string methods are great resources to start working around with strings in a programming language. Also, Python is considered one of the most important languages in NLP, so getting some proficiency in Python will definitely help you.
Bottom line, understanding these basic data manipulation techniques when working with strings, will save you a huge amount of headaches and blunders in your NLP journey.
NLTK Library
Yes, I know – NLTK, or short for Natural Language Toolkit Library is one of the oldest Natural Language Processing libraries out there. But, believe me, the library, that had it’s first release 20 years ago, is one of the best resources for learning some of the specific fundamentals regarding NLP. Some of the resources that are neatly implemented in the library are:
- Stemmers that range from simple to more advanced ones.
- Tokenizers that let you split your corpus into sentences or words.
- Part-Of-Speech taggers – off-the-shelf model versions and customized frequency taggers.
- Word Lemmatization.
- N-Grams concepts.
These concepts are fundamental to understand text normalization and text processing in most NLP applications. Understanding the NLTK library will let you develop the necessary skills to build an NLP pipeline from scratch. Even if you don’t end up using these techniques in your NLP pipelines, it is always a great asset to add these tools to your tool belt.
Alternatively, Spacy is also a good substitute for NLTK – Spacy is more recent and contains some great add-ons on the original NLTK functionalities— nevertheless, to learn the fundamentals, you will be fine with one or the other.
Reading Text Data
The huge amount of text data flowing on the web increased exponentially in the last decade. Apart from obtaining data from the web, NLP practitioners (as most data scientists) have to deal with different files in different formats.
Knowing how to read text data from multiple sources is an important skill for anyone that wants to work in NLP – as an example, CSV and JSON files are common formats of text corpora that need to be ingested to your workspace before proceeding with your NLP application.
Another ultra important skill is dominating web scraping – it is extremely important to understand HTML structures. By doing that, you get immediate access to the millions and millions of words scattered around the web – a lot of corpora used in NLP tasks are obtained using a combination of legal requests to websites and working around the HTML files. There are a ton of libraries out there that will help you achieve this but Requests and BeautifulSoup should be your first stop in Python for web scraping.
Word Vectors (and Neural Networks)
Word vectors are one of the most important techniques used in NLP today – also, they are really useful to understand the application of Artificial Neural Networks in the context of NLP.
Soon, it was understood that representing words as one-hot vectors would lead to many troubles and limitations in the field— the research breakthrough (made famous by the word2vec paper) enables practitioners to build word representations that bring context and meaning to mathematical numbers.
Think in this way: with word vectors, computers can somehow store vectors that represent the word in a specific context (you can drop by my article on the intuition of word vectors here). They are such an important concept to bring the logic behind human languages to computers – enabling them to relate words with their meaning and even performing mathematical operations on vectors that convey analogies or other linguistic relationships.
Studying and understanding most Word Vectors is not only important for the NLP field, but also to general Machine Learning. By learning them, you will be exposed to the inner working mechanisms of Neural Networks, one of the most important models in ML today. You will be exposed to different concepts such as backpropagation, weight optimization, activation functions and gradient descent – this should give you a good head start to run and build multiple Neural Network models.
Recurrent Neural Networks
Text generation is another sub-field of Natural Language Processing that has achieved massive breakthroughs with the application of Neural Networks.
The architecture of Neural Networks used in text generation comes with a different flavor than the one used in Word Vectors or Text Classification – called Recurrent Neural Networks, these types of NN’s contain several mechanisms to store and update data that is typical of chained data such as sentences.
By studying text generation, you will be able to understand several concepts :
- The problem of the vanishing gradient and how different architectures of Neural Networks solve that problem.
- The state of the art in terms of text generation.
- Long Short-Term Memory (LSTM) and GRU (Gated Recurrent Units) as a way to improve text generation models performance.
Learning artificial and recurrent neural networks will give you a good grasp on these types of models. You will able to understand the need behind using multiple architecture’s of NN’s and why there isn’t a one-size-fits-all solution when it comes to NLP.
Text Classification
Finally, one has to build those fancy sentiment classifiers, right? 🙂
Text classification helps you to classify text into classes using predictive models. Tree-based models, Naive Bayes Classifiers or Neural Networks are some of the most common used models to classify text into specific buckets. Some of the common usages of Text Classification are:
- Sentiment analysis;
- Spam Detection;
- Text Categorization;
By learning text classification techniques, you will be also applying several of the fundamentals we’ve discussed before such as text normalization, n-grams or stemming.
One good place to start is to build a simple sentiment classifier for Tweets that should get you a good head start on understanding how to build classification models.
And, that’s it! These fundamentals should help you understand most conversations around NLP and give you a good base for trying more advanced and state-of-art stuff. The field should continue to progress but these fundamentals should be around for quite some time as they are still being applied and developed on NLP pipelines around the world.
Do you think there is something missing? Write down in the comments below, I would love to hear your opinion.
_I’ve set up these fundamentals in a Udemy course — the course is suitable for beginners and I would love to have you around. The course also contains more than 50 coding exercises that enable you to practice as you learn these new concepts._