Pre-processing a Wikipedia dump for NLP model training — a write-up

Downloading, extracting, cleaning and pre-processing a Wikipedia dump for NLP model (e.g. transformers like BERT, RoBERTa, etc.) training

Steven van de Graaf
Towards Data Science

--

Wikipedia entry for Bert (from Sesame Street)

Wikipedia dumps are used frequently in modern NLP research for model training, especially with transformers like BERT, RoBERTa, XLNet, XLM, etc. As such, for any aspiring NLP researcher intent on getting to grips with models like these themselves, this write-up presents a complete picture (and code) of everything involved in downloading, extracting, cleaning and pre-processing a Wikipedia dump.

📥 Downloading a Wikipedia dump

Wikipedia dumps are freely available in multiple formats in many languages. For the English language Wikipedia, a full list of all available formats of the latest dump can be found here.

As we’re primarily interested in text data, for the purposes of this write-up, we’ll download such a dump (that contains solely pages and articles) in a compressed XML-format, using the code below:

Simple bash script to download the latest Wikipedia dump in the chosen language

To download the latest Wikipedia dump for the English language, for example, simply run the following command in your terminal: ./download_wiki_dump.sh en

🗜️ Extracting and cleaning a Wikipedia dump

The Wikipedia dump we’ve just downloaded is not ready to be pre-processed (sentence-tokenized and one sentence-per-line) just yet. First, we need to extract and clean the dump, which can easily be accomplished with WikiExtractor, using the code below:

Simple bash script to extract and clean a Wikipedia dump

To extract and clean the Wikipedia dump we’ve just downloaded, for example, simply run the following command in your terminal: ./extract_and_clean_wiki_dump.sh enwiki-latest-pages-articles.xml.bz2

⚙️ Pre-processing a Wikipedia dump

Now that we have successfully downloaded, extracted and cleaned a Wikipedia dump, we can begin to pre-process it. Practically, this means sentence-tokenizing the articles, as well as writing them one-sentence-per-line to a single text file, which can be accomplished using Microsoft’s blazingly fast BlingFire tokenizer, using the code below:

To pre-process the Wikipedia dump we’ve just extracted and cleaned, for example, simply run the following command in your terminal: python3 preprocess_wiki_dump.py enwiki-latest-pages-articles.txt

And that’s it, you’re done! 🙌 You can now start experimenting with the latest and greatest in NLP yourself, using your freshly created Wikipedia corpus. 🤗

--

--

Graduate student in Artificial Intelligence @UvA_Amsterdam with multiple years of experience in Python and VBA development.