Deep Learning

Beautifully Illustrated: NLP Models from RNN to Transformer

Explaining their complex mathematical formula with working diagrams

Albers Uzila

Published in

Towards Data Science

12 min readOct 11, 2022

Table of Contents· Recurrent Neural Networks (RNN)
  ∘ Vanilla RNN
  ∘ Long Short-term Memory (LSTM)
  ∘ Gated Recurrent Unit (GRU)· RNN Architectures· Attention
  ∘ Seq2seq with Attention
  ∘ Self-attention
  ∘ Multi-head Attention· Transformer
  ∘ Step 1. Adding Positional Encoding to Word Embeddings
  ∘ Step 2. Encoder: Multi-head Attention and Feed Forward
  ∘ Step 3. Decoder: (Masked) Multi-head Attention and Feed Forward
  ∘ Step 4. Classifier· Wrapping Up

Natural Language Processing (NLP) is a challenging problem in deep learning since computers don’t understand what to do with raw words. To use computer power, we need to convert words to vectors before feeding them into a model. The resulting vectors are called word embeddings.

Those embeddings can be used to solve the desired task, such as sentiment classification, text generation, name entity recognition, or machine translation. They are processed in a clever way such that the performance of the model for some tasks becomes on par with that of…

In transformers, the Layer Norm will be usually performed before the Attention blocks (Pre-LN) instead of after them (Post-LN). The reason is that Pre-LN usually performs better than Post-LN (I believe this has also been changed in the "Attention is…...

Thank you, my friend, your illustrations are beautiful! Why are Transformers so difficult to understand...?! Do they come from another planet? Even when I understand them, I can hardly imagine how someone came up with them.

Deep Learning

Beautifully Illustrated: NLP Models from RNN to Transformer

Explaining their complex mathematical formula with working diagrams

Create an account to read the full story.

Published in Towards Data Science

Written by Albers Uzila

Responses (8)

More from Albers Uzila and Towards Data Science

5 Special Lines in a Triangle

Altitude, median, and the three bisectors

Think Correlation Isn’t Causation? Meet Partial Correlation

Despite being so powerful, partial correlation is perhaps the most underrated tool in data science

Measuring Quantum Noise in IBM Quantum Computers

A discussion around measuring error rates in IBM quantum processors, with code examples, using Qiskit

5 Popular CNN Architectures Clearly Explained and Visualized

And why the heck does Inception look like a trident?!

Recommended from Medium

Natural Language Processing Series Part 3 : A Beginner’s Guide to Understanding RNNs

A Beginner’s Guide to Understanding RNNs and Their Problems, Like Exploding and Vanishing Gradients

The Math Behind Recurrent Neural Networks

Dive into RNNs, the backbone of time series, understand their mathematics, implement them from scratch, and explore their applications

Lists

Predictive Modeling w/ Python

Natural Language Processing

Practical Guides to Machine Learning

data science and AI

LLM Architectures Explained: NLP Fundamentals (Part 1)

Deep Dive into the architecture & building of real-world applications leveraging NLP Models starting from RNN to the Transformers.

Constructing a Multilayer Perceptron (MLP) from Scratch in Python

We’ll dive into the implementation of a basic neural network in Python, without using any high-level libraries like TensorFlow or PyTorch…

A Brief Introduction to Recurrent Neural Networks

An introduction to RNN, LSTM, and GRU and their implementation

The Math Behind Multi-Head Attention in Transformers

Deep Dive into Multi-Head Attention, the secret element in Transformers and LLMs. Let’s explore its math, and build it from scratch.