The Ultimate Guide to Training BERT from Scratch: The Tokenizer

From Text to Tokens: Your Step-by-Step Guide to BERT Tokenization

Dimitris Poulopoulos
Towards Data Science
13 min readSep 6, 2023

--

Photo by Glen Carrie on Unsplash

Part I, Part III, and Part IV of this story are now live.

Did you know that the way you tokenize text can make or break your language model? Have you ever wanted to tokenize documents in a rare language or a specialized…

--

--

Machine Learning Engineer. I talk about AI, MLOps, and Python programming. More about me: www.dimpo.me