The world’s leading publication for data science, AI, and ML professionals.

10 Things You Might Not Know About Wikipedia Library In Python

Fetching data is much simpler with a line of code using Wikipedia API!

Photo by Alfons Morales on Unsplash
Photo by Alfons Morales on Unsplash

Introduction

In my previous post 5 Interesting Python Libraries That You Might Have Missed, I talked about 5 underestimated Python Libraries that I had rarely heard about, one of which is Wikipedia API. As I did some more reading into this library, I figured out that this is way cooler than I had expected. In this article, I will share with you some examples using this simple, convenient and useful library.

What to expect

As Wikipedia API can retrieve almost all contents from a Wikipedia page, we do not need to rely heavily on different Data scraping techniques in this case. Fetching data is much simpler with a line of code using Wikipedia API.

_However, one important note to remember is that the original purpose of this library is not for advanced use. Therefore, as the document suggests, if you intend to do any scraping projects or automated requests, consider alternatives such as Pywikipediabot or MediaWiki API, which has other superior features._

Alright, first, install this cool library and let’s see what this package can bring us.

!pip install wikipedia
import wikipedia

How it works

1. Getting summary of a specific key words

If you wish to get a particular number of summary sentences for anything, just pass that as an argument to the summary()function. For example, I’m trying to figure out what Covid-19 is in 4 sentences.

Output:

Figure 1: Summary sentences - Image by Author
Figure 1: Summary sentences – Image by Author

2. Searching article titles

Search() function helps us search all titles that contain a specific keywords. For instance, if I want to get all post titles relating to "KFC", I will pass "KFC" inside the search function.

As a result, a list of all articles in Wikipedia that include information about KFC is retrieved as you can see below.

Output:

Figure 3: List of titles - Image by Author
Figure 3: List of titles – Image by Author

You can also state how many titles you want to appear as a result.

3. Searching keywords

In case you have something to look up in your mind but you cannot remember exactly what it is, you can consider suggest() method. The function returns related words.

Suppose I want to find the exact name of the German Chancellor, but do not remember how her name is spelled. I can write what I remember, which is "Angela Markel" and let suggest() do the rest for me.

As you can see, the function returns a correct one, which is "Angela Merkel".

Output:

Figure 5: Correct spell - Image by Author
Figure 5: Correct spell – Image by Author

5. Extracting content

If you wish to extract all the contents of a Wiki page in a plain text format, try the content attribute of the page object.

In the example below, I demonstrate how to get "History of KFC" article. The result doesn’t include pictures or tables,…just plain text.

Here is what the results looks like. Simple, right?

Figure 7: History of KFC - Image by Author
Figure 7: History of KFC – Image by Author

You can even create a loop to fetch different contents of different articles related to your defined topic by combine search() and page().content . Let’s try to combine several article for Oprah Winfrey.

6. Extracting URL of a page

You can easily extract URL of any page in Wikipedia with url attribute of the page object.

Output:

Figure 10: URL of the page - Image by Author
Figure 10: URL of the page – Image by Author

7. Extracting reference URLs

You can even extract all the reference URLs on Wikipedia page with page object and a different attribute this time, which is references.

Output:

Well, a list of external references is extracted as following:

Figure 12: Reference links - Image by Author
Figure 12: Reference links – Image by Author

8. Getting page category

What if I want to figure out how my article is categorized by Wikipedia? Another property of page object is used, which is categories. I will try to find out all external links to the above article, "History of KFC".

Output:

Figure 14: Categories - Image by Author
Figure 14: Categories – Image by Author

9. Extracting page images

Images can also be retrieved with a line of command. By using page().images,you will get the link of the image. Continue with my example, I will try to get the second picture from "History of KFC" page.

Look what I got here:

Output:

https://upload.wikimedia.org/wikipedia/commons/b/b1/Col_Sanders_Restaurant.png 

The link gets you to Sanders’ Restaurant!

Image by Acdixon on Wikipedia
Image by Acdixon on Wikipedia

10. Changing language output

The language can be changed to any language if the page exists in that language. Set_lang()method is used for this case. A little bit outside of the topic, but I think this is such a great way to learn new languages. You can try different languages to understand a specific paragraph. The translations are all on the screen.

Above is how I translated summary about "Vietnam" to English.

Output:

Figure 17: English summary about Vietnam - Image by Author
Figure 17: English summary about Vietnam – Image by Author

Last words

This is quite interesting, right? Wikipedia is one of the largest sources of information on the Internet, and is a natural place for data gathering. With various features of Wikipedia API, this will be much easier.

If you have any interesting libraries, please do not hesitate to share with me.

In order to receive updates regarding my upcoming posts, kindly subscribe as a member using the provided Medium Link.

Reference

https://pypi.org/project/wikipedia/


Related Articles