Pre-processing a Wikipedia dump for NLP model training — a write-up
Downloading, extracting, cleaning and pre-processing a Wikipedia dump for NLP model (e.g. transformers like BERT, RoBERTa, etc.) training
Wikipedia dumps are used frequently in modern NLP research for model training, especially with transformers like BERT, RoBERTa, XLNet, XLM, etc. As such, for any aspiring NLP researcher intent on getting to grips with models like these themselves, this write-up presents a complete picture (and code) of everything involved in downloading, extracting, cleaning and pre-processing a Wikipedia dump.
📥 Downloading a Wikipedia dump
Wikipedia dumps are freely available in multiple formats in many languages. For the English language Wikipedia, a full list of all available formats of the latest dump can be found here.
As we’re primarily interested in text data, for the purposes of this write-up, we’ll download such a dump (that contains solely pages and articles) in a compressed XML-format, using the code below:
To download the latest Wikipedia dump for the English language, for example, simply run the following command in your terminal: ./download_wiki_dump.sh en
🗜️ Extracting and cleaning a Wikipedia dump
The Wikipedia dump we’ve just downloaded is not ready to be pre-processed (sentence-tokenized and one sentence-per-line) just yet. First, we need to extract and clean the dump, which can easily be accomplished with WikiExtractor, using the code below:
To extract and clean the Wikipedia dump we’ve just downloaded, for example, simply run the following command in your terminal: ./extract_and_clean_wiki_dump.sh enwiki-latest-pages-articles.xml.bz2
⚙️ Pre-processing a Wikipedia dump
Now that we have successfully downloaded, extracted and cleaned a Wikipedia dump, we can begin to pre-process it. Practically, this means sentence-tokenizing the articles, as well as writing them one-sentence-per-line to a single text file, which can be accomplished using Microsoft’s blazingly fast BlingFire tokenizer, using the code below:
To pre-process the Wikipedia dump we’ve just extracted and cleaned, for example, simply run the following command in your terminal: python3 preprocess_wiki_dump.py enwiki-latest-pages-articles.txt
And that’s it, you’re done! 🙌 You can now start experimenting with the latest and greatest in NLP yourself, using your freshly created Wikipedia corpus. 🤗