Dataset creation and cleaning: Web Scraping using Python — Part 2

Published in

Towards Data Science

6 min readOct 5, 2018

“open book lot” by Patrick Tomasso on Unsplash

In the first part of this two part series, we covered an approach to extract data from Wikipedia webpages using Web Scraping with the help of BeautifulSoup and Python. Taking a closer look at the dataset we can clearly see that there is noise in the collected data which must be removed. Whenever data is collected using web scraping, it usually is filled with noise that we can see here. Thus, cleaning the collected data is required to create a uniform and usable dataset before any analysis can be performed.

Here, we’ll tackle the Wikipedia dataset we created in the last part and address each column one by one till we have a fairly useful dataset.

Import libraries and read CSV file

We import three libraries re, numpy and pandas. We’re already aware of numpy and pandas. The third library re is actually used to format strings based on a regular expression (commonly referred to as regex), which basically defines a search pattern. We’ll explore its use in this article, where how it can be used is taken from its documentation.

To begin with, we use pandas function read_csv() to read the dataset file and store it in the dataset variable.

Replace headings

On close inspection, it can be seen that headings can be made more descriptive, so we will replace where necessary. We will change Country(or dependent territory) to Country, % of worldpopulation to Percentage of World Population and Total Area to Total Area (km2).

Finalised column headings

Now, we would use the newly finalised names for our columns to access the data and refine the content in them.

Analysing and cleaning the dataset

Let’s first take a look at the dataset in its present state. We can see that almost all the columns have information inside square brackets and parentheses. The text inside square brackets usually refers to further links on a Wikipedia page. In our case, we would not be accessing any links and these provide no useful information so we can remove them. Data inside parentheses usually includes additional information about the real data. For example, in the column Total Area, the area is defined in the other metric units inside the parentheses. We would not need this data either so we can remove all parentheses too.

Further, we remove ‘%’ sign from all values in columns Percentage of World Population and Percentage Water as this is evident from the column heading directly.

The data looks much better now but there is still a lot more we can do.

Handle one column at a time

Country helps us identify the context and need not be modified for now. The Percentage of World Population is already in the correct format after our previous steps, so we do not change it either.

Population and Total Area columns

The Population column has numbers separated using commas. Here, we will remove the commas using the replace() function. This function takes the first argument as the characters we want to replace and the second argument defines the characters we want them to be replaced with. Here, we replace , with nothing, hence the function replace(',','').

To access each cell, we iterate the column with the help of for loop, and then use the Pandas function iloc[] to read and update specific cells in that column.

Next, we move to the Total Area (km2) column. We begin by removing commas from this column as well. Next, some columns have data in sq mi while others have the area in the units km2. We will have to convert all sq mi values to km2 by multiplying the values with 2.58999.

To implement this, we check if the cell has sq mi in the string and if yes, we remove all non-numeric characters except . and multiply the value by 2.58999. The value is finally converted back to integer. I used a function repr to get the printable version of the values in this column. On looking at the result, I saw that sq mi is actually represented as sq\xa0mi in printable format and matching must be done against this. This is reflected in the code too. If the units are already in km2, we simply remove all non-numeric characters except . and convert the value to integer. To remove all such characters we use the regex [^0-9.]+, which basically means that we remove all characters, in any count (marked by +), except the numbers 0 to 9 and .

However, on trying the same, I found that some values are in ranges and we need to tackle them first. So, for all those countries, we read the range, split it using the split('-') function and take the first value as the value we would be considering for that cell.

Modified ‘Population’ and ‘Total Area’ columns

Percentage Water column

This column has a few outlier values. Firstly, for some countries like Algeria, and Afghanistan, whose water content is very low, this column value is marked with Negligible or negligible. Further, for Chile there is an extra b present in the end.

Thus, we need to replace Negligible and negligible with 0.0 and remove any extra characters apart from numbers and . we encounter in these cells. We use the replace() function without and with regex as defined before to achieve this.

For some cases where the values were missing, incorrect values were recorded which must be removed. Fortunately, it’s easy to tackle this. We can simply remove all rows that have the values above 100. This works because we cannot have percentage values above 100. We first convert each string in the column to float using astype(float) and then keep only those columns which have the value less than or equal to 100.

Total Nominal GDP and Per Capita GDP columns

ForTotal Nominal GDP, we first remove the dollar sign. Next, we observe that the amount in this column is represented as strings of trillion, billion and million. Thus, we read each cell, identify the denomination, remove all characters except numbers and . and multiply the amount by the denomination value numerically.

For Per Capita GDP, we keep numbers 0–9 and . while removing all other characters including commas and dollar sign.

Export the final dataset

Finally, our dataset is now clean and is ready to be exported into an external CSV file. We use the Pandas function to_csv to export the dataframe to the file Final_dataset.csv.

Conclusion

In this article, we identified that while the first step in web scraping i.e. data collection is important, cleaning of that data is equally important. This step can have some challenges of its own which we must tackle before a proper dataset can be developed for useful analysis. The final dataset is also present in the GitHub repository.

Hope you like my work. Please feel free to reach out, comment, share and express your opinions.