The world’s leading publication for data science, AI, and ML professionals.

Easy Access to the World’s Largest Data Source

Wikipedia API for Python

Wikipedia API for Python

Photo by Clay Banks on Unsplash
Photo by Clay Banks on Unsplash

The importance of data comes way before then building state-of-the-art algorithms in Data Science. Without the proper and vast amount of data, we cannot train the models well enough to get satisfying results.

Wikipedia, being the largest encyclopedia of the world, can serve as a great data source for many projects. There are many web scraping tools and frameworks that allow getting data from Wikipedia. However, the Wikipedia API for Python might be the simplest one to use.

In this post, we will see how to use the Wikipedia API to:

  • Access the content of a particular page
  • Search for pages

You can easily install and import it. I will be using Google Colab so here is how it is done in Colab:

pip install wikipedia
import wikipedia

The content of a page can be extracted with the page method. The title of the page is passed as an argument. The following code will return the Support Vector Machine page as a WikipediaPage object.

page_svm = wikipedia.page("Support vector machine")
type(page_svm)
wikipedia.wikipedia.WikipediaPage

This object holds the URL of the page which can be accessed with the url method.

page_svm.url
https://en.wikipedia.org/wiki/Support_vector_machine

We can access the content of the page with the content method.

svm_content = page_svm.content
type(svm_content)
str

The content is returned as a string. Here are the first 1000 characters of the svm_content string.

The returned content is a string which is not the optimal format for analysis. We can process this raw string to infer meaningful results. There are efficient natural language processing (NLP) libraries to work with textual data such as NLTK, BERT, and so on.

We will not go in detail into NLP tasks but let’s do a simple operation. We can convert the content string into a python list that contains the words as separate elements. We can then count the number of occurrences of a specific word.

content_lst = svm_content.split(" ")
len(content_lst)
57779
content_lst.count("supervised")
4

The title method is used to access the title of a page.

The references method returns a list of the references used on the page. The first 5 references on this page are as follows:

Similarly, we can also extract the links to the images on the page:


We do not always have an exact title in mind. Consider we are looking for pages that contain the word "psychology" in the title. It can be done with the search method.

In some cases, the returned list is too long so we may want to limit the returned items. Just pass the desired number of items to the results parameter.


Note: All images created by author unless stated otherwise.

Edit: Thanks to Nick Webb for pointing this out. The creators of the Wikipedia API explicitly states that "this library was designed for ease of use and simplicity, not for advanced use. If you plan on doing serious scraping or automated requests, please use Pywikipediabot (or one of the other more advanced Python MediaWiki API wrappers), which has a larger API, rate limiting, and other features so we can be considerate of the MediaWiki infrastructure."

Conclusion

Wikipedia is a highly valuable data source. It provides access to structured information on numerous topics and serves as a data source for Machine Learning and deep learning tasks. For instance, the data can be used to train complex NLP models.

The Wikipedia API makes it very easy and simple to access and use this enormous source of data.

Thank you for reading. Please let me know if you have any feedback.


Related Articles