The easy way to get data from the web

Lusting After Tables On Websites
Let’s say you are reading a Wikipedia page and a big table of data is on the page. Inspiration strikes and it’s exactly something you could use for a project in Python. But how to actually get it into a Pandas dataframe so you can manipulate it? Thankfully you can import tables directly from the web with the Pandasread_html()
function!
As an example in this article let’s use the Wikipedia page "List of best-selling music artist". This page contains some nice tables that could potentially be used for analysis in a project.
Table Elements On Websites
The read_html()
function in Pandas works by looking for table elements in the html of the website. Let’s see what that means by looking at the code of the first table on the Wikipedia page for our example. If we right click on the first cell of the first table and click the "Inspect" option, the code for the site will pop up in a box. In this code we will see that this table is coded as a <table>
element.

Looking at the expanded code for the table you can see various other elements like the table header (<thead>
) coded as a row with each of the headings (<th>
) formatted accordingly. Then you can see that the table body begins (<tbody>
) where each row (<tr>
) is entered and the various data added for each cell of the table (<td>
). Pandas uses these elements of html code to figure out how the data should fit in a dataframe.
I encourage you to use the gray arrows to expand the code further when looking into your own site of interest. It is useful to get more familiar with the html to understand how it gets parsed for conversion into a dataframe by Pandas.
Bringing the Data Into Pandas
Now that we know a little better about how it works, let’s actually use the read_html()
function. It is really simple as the URL is just passed as a string into the function.
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_best-selling_music_artists'
pd.read_html(url)
In our example, what is returned is a list of dataframes. It returns a list because there are several table elements on this Wikipedia page. By indexing the results, we can inspect each dataframe in the list. The first dataframe is the first table on the page which is for artists that have sold 250 million or more records. We can assign this to a variable and begin working with it like any other dataframe. Here is what that code would look like:
url = 'https://en.wikipedia.org/wiki/List_of_best-selling_music_artists'
250_plus = pd.read_html(url)[0]
And here is what the resulting dataframe looks like:

In our example, Pandas was even able to parse the column headers using the table header elements of the html code from above. If we want to save the file to our computer we would use use the code 250_plus.to_csv('250_plus.csv')
and voila! We now have a saved csv file of the table data from the website.
Finishing Up

Expect a certain amount of cleaning to be required with this method. For instance, in our example, it would make sense to combine all the tables into one. That would mean that we need to go through the list of dataframes created from the website, create a separate variable for each of the dataframes that we want, and manually combine them.
Many times you may also find that the data does not parse so cleanly and there are issues with the header, null values, or the like. If the task proves to be difficult or impossible with the read_html()
function then you can use a web scraper instead. By looking at the html elements you can usually predict how well the data will transfer, if at all. But it’s always worth a try to see if the read_html()
function works before having to build a web scraping function.
Don’t forget to take a look at the Pandas documentation for more parameters or help when using this function!