No matter what level of data science/analytics skills we have, you cannot do anything without datasets.
Indeed, there are many open-source datasets such as Kaggle and Data.world. However, they are more suitable to be used for exercises and learning purposes, but may not satisfy our general needs.
Usually, data scientists/analysts may have more or less web scraping skills, so it will be much easier to get datasets whenever you saw on the websites. After scraping the content from the websites, a series of transforming, extracting and cleansing manipulations will help us to get the clean dataset for the next step. This is one of the typical usages of Python because there are many excellent web scraping libraries available in Python such as Scrapy and Beautiful Soup.
However, learning these web scraping libraries is not going to be that easy. Also, for those who do not have web development backgrounds, which is probably not a common skill for data scientists/analysts, it might be even more difficult to understand and nail the concepts such as HTML DOM and XPath.
However, in this article, I will introduce a much easier way of downloading datasets from websites. You will find that you can even use Pandas to do all the work!
COVID-19 New Cases Dataset

The first example is the Covid-19 New Cases dataset. It turns out there are many publicly available data sources such as the government websites which will publish this. However, I just want to use this as an example, as I found that this is a very good and typical example.
Today, when I scan news from "The Age" (A popular media of Australia) website, I found that there is a bar chart showing daily new confirmed cases in Australia (see screenshot below).

What if I want to get their Data? Let’s find its API endpoint!
Hunt for API Endpoint
Most of the popular web browsers provide their developer consoles now. We will use the console to do this job. Here are some shortcuts to call out the console from your browsers:
- Google Chrome: Ctrl + Shift + I
- Firefox: Ctrl + Shift + K
- Microsoft Edge: F12
Here I will use Google Chrome for demonstration. But don’t worry, most of the browsers have very similar developer consoles. I believe you will be able to figure out where are the features.

Once the console appears, go to the "Network" tab. We are looking for an API endpoint, so it would be caught as an "XHR" request if it is available. So, let’s select the XHR filter.
XMLHttpRequest (XHR) is an API in the form of an object whose methods transfer data between a web browser and a web server. The object is provided by the browser’s JavaScript environment. Particularly, retrieval of data from XHR for the purpose of continually modifying a loaded web page is the underlying concept of Ajax design. Despite the name, XHR can be used with protocols other than HTTP and data can be in the form of not only XML, but also JSON, HTML or plain text. [1]
Sometimes, the webpage may have some JavaScripts running in the background to do some scheduled jobs and the browser will catch them all. If you want to have a cleaner list of caught requests, the best approach is to
- Make sure the "recording" button is enabled
- Refresh the webpage
- Stop "recording" when you see the data related content has already been fully rendered on the webpage
Now, you will have a list of requests in the developer console.

OK. This one is very obvious. We can see that there is a request to have its name called "covid-19-new-cases-json.json…". That must be it!
Go to the "Headers" tab, you will see the detail of this request. The most important thing is the URL.

Now, let’s just open that URL as another browser tab to see what happen.

Cool! This is the API endpoint we’re looking for.
Read API Endpoint using Pandas
How to consume it? Extremely easy!
I believe you must use Pandas Dataframe if you are a data scientist or data analyst. Just one line of Python code, you got everything directly into Pandas Dataframe.
df = pd.read_json('https://www.theage.com.au/interactive/2020/coronavirus/data-feeder/covid-19-new-cases-json.json?v=3')

IKEA Furniture list

Not all websites are using REST API endpoints so that I wouldn’t say this is a universal approach. However, you’ll find that there are a considerable number of websites using that. Let’s just take one more example, the IKEA website.
Let’s say you want to get all the beds from IKEA together with the product details to perform some analysis. Here is the request that my browser caught from the URL (https://www.ikea.com/au/en/cat/beds-bm003/).

This one is more interesting. Please note that it says that the "product count" is 220, but the "product window" only give 24 of them. If you pay attention to the page, it turns out only 24 products will be listed on the page, and a "load more" button is provided if you want to keep browsing more. Well, this is very common in Web Development to save bandwidth and server resources.
But does that mean we have to give it up? Absolutely no. Let’s have a look at the request URL:

Have you seen that there is a property called "size"? It exactly equals 24, which is the page size. Let’s try to change that to size=220
, and send the request using the Python built-in library requests
.
import requests
json = requests.get(
'https://sik.search.blue.cdtapps.com/au/en/product-list-page?category=bm003&sort=RELEVANCE&size=220&c=plp&v=20200430'
).json()['productListPage']['productWindow']
The reason why we cannot directly use Pandas this time is that the product list is not at the root level of the JSON object. Instead, it is at root.productListPage.productWindow
.
After that, the request will convert the JSON object to a Python dictionary, so we can read it using Pandas now.
df = pd.DataFrame.from_dict(json)

Summary

Isn’t that quick and easy? You don’t have to learn those web scraping libraries. But, of course, this approach is not going to be effective for all the websites, and that is why the web scraping libraries are still necessary. However, when there is an available API endpoint on the website, why not just find it and directly use it?
If you feel my articles are helpful, please consider joining Medium Membership to support me and thousands of other writers! (Click the link above)
Reference
[1] Wikipedia. XMLHttpRequest. https://en.wikipedia.org/wiki/XMLHttpRequest