Dataset creation and cleaning: Web Scraping using Python — Part 1

Karan Bhanot
Towards Data Science
9 min readOct 1, 2018

--

“world map poster near book and easel” by Nicola Nuttall on Unsplash

In my last article, I discussed about generating a dataset using the Application Programming Interface (API) and Python libraries. APIs allow us to draw very useful information from a website in an easy manner. However, not all websites have APIs and this makes it difficult to gather relevant data. In such a case, we can use web scraping to access a website’s content and create our dataset.

Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format. — WebHarvy

Generally, web scraping involves accessing numerous websites and collecting data from them. However, we can limit ourselves to collect large amount of information from a single source and use it as a dataset. In this particular example, we’ll explore Wikipedia. I’ll also explain the HTML basics we would need. The complete project is available as a Notebook in the Github repository Web Scraping using Python.

This example is just for demonstration purpose. However, we must always follow the website guidelines before we can scrape that website and access its data for any commercial purpose.

This is a 2 part article. In this first part, we’ll explore how to get the data from the website using BeautifulSoup and in the second part, we’ll clean the collected dataset.

Determine the content

“man drawing on dry-erase board” by Kaleidico on Unsplash

We’ll access the List of countries and dependencies by population Wikipedia webpage. The webpage includes a table with the names of countries, their population, date of data collection, percentage of world population and source. And if we go to any country’s page, all information about it is written on the page with a standard box on the right. This box includes a lot of information such as total area, water percentage, GDP etc.

Here, we will combine the data from these two webpages into one dataset.

  1. List of Countries: On accessing the first page, we’ll extract the list of countries, their population and percentage of world population.
  2. Country: We’ll then access each country’s page, and get information including total area, percentage water, and GDP (nominal).

Thus, our final dataset will include information about each country.

HTML Basics

Each webpage that you view in your browser is actually structured in HyperText Markup Language (HTML). It has two parts, head which includes the title and any imports for styling and JavaScript and the body which includes the content that gets displayed as a webpage. We’re interested in the body of the webpage.

HTML is comprised of tags. A tag is described by an opening < and closing > angular bracket with the name of the tag inside it as a start, while it marks an ending if there is a forward slash / after the opening angular bracket. For example, <div></div>, <p>Some text</p> etc.

Homepage.html as an example

There are two direct ways to access any element (tag) present on the webpage. We can use id, which is unique or we can use a class which can be associated with multiple elements. Here, we can see that <div> has the attribute id as base which acts as a reference to this element while all table cells marked by td have the same class called data.

Generally useful tags include:

  1. <div>: Whenever you include certain content, you enclose it together inside this single entity. It can act as the parent for a lot of different elements. So, if some style changes are applied here, they’ll also reflect in its child elements.
  2. <a>: The links are described in this tag, where the webpage that will get loaded on click of this link is mentioned in its property href.
  3. <p>: Whenever some information is to be displayed on the webpage as a block of text, this tag is used. Each such tag appears as its own paragraph.
  4. <span>: When information is to be displayed inline, we use this tag. When two such tags are placed side by side, they’ll appear in the same line unlike the paragraph tag.
  5. <table>: Tables are displayed in HTML with the help of this tag, where data is displayed in cells formed by intersection of rows and columns.

Import Libraries

We first begin by importing necessary libraries, namely, numpy, pandas, urllib and BeautifulSoup.

  1. numpy: A very popular library that makes array operations very simple and fast.
  2. pandas: It helps us to convert the data in a tabular structure, so we can manipulate the data with numerous functions that have been efficiently developed.
  3. urllib: We use this library to open the url from which we would like to extract the data.
  4. BeautifulSoup: This library helps us to get the HTML structure of the page that we want to work with. We can then, use its functions to access specific elements and extract relevant information.
Import all libraries

Understand the data

Initially, we define we just the basic function of reading the url and then extracting the HTML from the same. We’ll introduce new functions as and where they are needed.

Function to get HTML of a webpage

In the getHTMLContent() function, we pass in the URL. Here, we first open the url using the urlopen method. This enables us to apply BeautifulSoup library to get the HTML using a parser. While there are many parsers available, in this example we use html.parser which enables us to parse HTML files. Then, we simply return the output which we can then use to extract our data.

We use this function to get the HTML content for the Wikipedia page of List of countries. We see that the countries are present in a table. So, we use the find_all() method to find all tables on the page. The parameter that we supply inside this function determines the element that it returns. As we require tables, we pass the argument as table and then iterate over all tables to identify the one we need.

We print each table with the prettify() function. This function makes the output more readable. Now, we need to analyse the output and see which table has the data we are searching for. After much inspection, we can see that the table with the class, wikitable sortable, has the data we need. Thus, our next step is to access this table and its data. For this, we will use the function find() which allows us to not only specify the element we are looking for but also specify its properties such as the class name.

Print all country links

A table in HTML is comprised of rows denoted by the tags <tr></tr>. Each row has cells which can either be headings defined using <th></th> or data defined using <td></td>. Thus, to access each country’s webpage, we can get its link from the cells in the country column of the table (second column). So, we iterate over all the rows in the table and read the second columns’s data in the variable country_link. For each row, we extract the cells, and get the element a in second column (numbering in Python starts with 0, so second column would mean cell[1]). Finally, we print all the links.

The links do not include the base address, so whenever we access any of these links, we’ll append https://en.wikipedia.org as the prefix.

While the function I developed to extract the data from each country’s webpage might appear small, there have been many iterations for it before I finalised the function. Let’s explore it step by step.

Each country’s page includes an information box on the right which includes the Motto, Name, GDP, Area and other important features. So, first weidentified the name of this box by the same steps as before and it was a table with the class as infobox geography vcard. Next, we define the variable additional_details to collect all the information we will get from this page in an array which we can then append with the list of countries dataset.

When we enter the inspect mode of Chrome browser (right click anywhere and select Inspect option) on the country page, we can look at the classes for each heading in the table. We are interested in four fields, Area — Total area, Water (%), and GDP (nominal) — Total, Per capita.

Area — Total Area and Water (%)
GDP (nominal) — Total and Per capita

We can easily infer that the headings Area and GDP (nominal) have classes mergedtoprow while the data that we want to extract have classes as mergedrow or mergedrowbottom. But, we cannot directly reach to any data element as the order when they occur changes based on each country. For some country, a particular field may be missing and mergedrow for total area might be 6th while for some other it might be 7th. Thus, we need to first see the text for the mergedtoprow and if it matched either Area or GDP(nominal), we should read and collect that data.

Seems simple enough but when I tried it, I instantly saw a problem as the water (%) for some countries was more than 100. This is not possible and thus, we must be missing something. This is when I realised what the problem was. You see, if we read two values after Area heading and the water value is missing, we’ll incorrectly read the first value of population, thus giving us wrong data. As a result, we need to ensure that when the Population heading is identified, we stop reading the values.

Now, we can simply add all our knowledge to the function getAdditionalDetails(). We define the variable read_content to flag if we read the next value or not. We use three types of functions here, apart from those described before:

  1. get(): This enables us to not only find but also get the reference to a particular element.
  2. get_text(): This gets the value that is present within the opening and closing tags of an element.
  3. strip(): This removes any additional leading and trailing spaces that might be present in the text. We can also specify any specific value we might wish to be removed, for example, in out case the new line character \n.

Iterating over all rows in the table, the iterator checks if the present row is a heading matching Area or GDP (nominal) and starts reading. In reading mode, it checks if the new element is Total area or Total and if yes, it reads it and keeps reading to read the Water (%) or per capita GDP respectively in the next run.

We use try and except to ensure that even if we miss certain values, our whole process does not end. This is another lesson I learnt when I was iterating through the complete list of countries. Some countries do not have all the information that we need, or maybe we cannot find the table by the required name. In such a case, our process might throw an error which we must catch to return an empty array and allow the process to continue for other countries.

Create the dataset

Finally, we now know what all information we need to collect as well as how to collect it.

We first read each row of the table from the list of countries and gather each country’s name, population and % of world population. We then use the link to get all the additional details including total area, water (%), total GDP and per capita GDP. If however, the additional information is less than 4, then the information for that country is missing and we do not use that data. Otherwise, we add all that information to the data_content array which is compiled into the dataset dataframe.

Next, we read the table’s headings and append the headings for the 4 extra columns. These act as the headings for our dataset which we export into a .CSV file which can be found here.

Even though our dataset is now ready, it’s not in the most useful format. We need to clean this dataset using proper formatting, unifying metrics and removing unnecessary signs and values. We will cover this in the 2nd part of this article.

Hope you liked my article. Please feel free to reach out and share your thoughts.

--

--

Data science and Machine learning enthusiast. Technical Writer. Passionate Computer Science Engineer.