
"Getting information off the internet is like taking a drink from a firehose." – _Mitchell Kapor_
The internet is flooded with Data. However, getting that data in the right format is the real deal. It takes some amount of cleaning and filtering to bring the data into a format that is ideal for analysis. Last week I came across some pretty good tabular data sets on Wikipedia. As always, I copied the data and pasted it on the excel sheet. For most parts, it worked well, but there were few instances when the typical copy-paste methodology failed miserably:
- When the table was typically long and spread across a complete webpage
- When the data wasn’t static, i.e., it updated after regular intervals. So every time, there was a change in the original dataset, I had to fetch the dataset again.
For situations like these, simply copy-pasting wasn’t enough. Scraping was another option, but I was looking for something quick and easy and which didn’t involve a ton of code. It was then that I came across a convenient function in Google Sheets called `IMPORTHTML,` which is ideal for importing data from a table or list within an HTML page. In this article, I’ll describe the end to the end process of fetching tables( and lists) into google sheets.
Prior to extracting or importing data that hasn’t been curated by you make sure you have the necessary permissions. Also, this method works only if the data is publicly available, which means there shouldn’t be requirements for authorisation etc.
This article is part of a complete series on finding good datasets. Here are all the articles included in the series:
Part 1: Getting Datasets for Data Analysis tasks – Advanced Google Search
Part 2: Useful sites for finding datasets for Data Analysis tasks
Part 3: Creating custom image datasets for Deep Learning projects
Part 4: Import HTML tables into Google Sheets effortlessly
Part 5: Extracting tabular data from PDFs made easy with Camelot.
Part 6: Extracting information from XML files into a Pandas dataframe
Part 7: 5 Real-World datasets for honing your Exploratory Data Analysis skills
Syntax

Before using the function, let’s quickly understand the syntax of the IMPORTHTML
function. The IMPORTHTML
function takes in three distinct parameters and is written as follows:
IMPORTHTML(URL, query, index)
Parameters
- URL: Refers to the URL of the page where the table is. Do not forget to add the protocol (e.g. http://).) and make sure the URL is between quotes.
- Query: query could be either a table or a list of items. For example, if you want to import a table, mention table within quotes, else mention a list.
- Index: Refers to the position of the table on the web page, which starts at 1.
Usage: Importing a Table
Let’s now see how we can quickly and seamlessly import an HTML table referenced in a Wikipedia article into a Google sheet. The Page is titled: List of [Netflix](https://en.wikipedia.org/wiki/Netflix) original films and contains information about various films and programs distributed by Netflix, an American global on-demand Internet streaming media provider.
Step 1:Copy the URL

Step 2: Select the table to be imported
Let’s say we wanted to import all the Feature Films, which is the first table on the page.

Step 3: Add the formula in the Google Sheet
Create a new Google Sheet, or create a new tab on an existing sheet. Enter the following formula in a cell specifying the URL, query, and index number.
=IMPORTHTML("https://en.wikipedia.org/wiki/List_of_Netflix_original_films_(2012%E2%80%932019)","table",1)
Voila! The entire table is extracted and populated in the spreadsheet. Here’s what you get in the google sheet:

Let’s combine all the three steps above and look at an end to end demo below:

You can import a list in the same way. The only change would be to replace the word table with the list in the function parameter.
Customizing the output: Use Query Function
There are a bunch of things that you can do to customize the imported data. This can be achieved by using the `Queryfunction in conjunction with
IMPORTHTML.The **
Query`** function runs a Google Visualisation API Query across data. Let’s see some of the ways to do this:
Limiting the number of imported columns
Let’s say you do not want to import all the columns, rather a select few. You can pass the index of the desired columns. For instance, if we only want the first three columns, our function would be as follows:
=query(IMPORTHTML("https://en.wikipedia.org/wiki/List_of_Netflix_original_films","table",1),"Select Col1,Col2,Col3")

Look how we have enclosed the IMPORTHTML
function with the query
function and specified the column indices.
Filtering data based on a column
Another scenario would be when we want some specific data, i.e. data about only the Comedy genre. This can be done very easily as follows:
=query(IMPORTHTML("https://en.wikipedia.org/wiki/List_of_Netflix_original_films","table",1),"Select * where Col2='Comedy'")

Check out the official documentation of the QUERY
function for other ways to use it.
Conclusion
In this article, we saw the power of the IMPORTHTML
function and how it could be used to import tables and lists from the website, blogs, and other HTML sources directly into a google spreadsheet. From there, you can perform several analytics tasks, on the dataset and derive meaningful information from it. You can also extract the data and share it with others so that they can also make use of it.