
Three years ago, I was working as a student assistant in the Institutional Statistics Unit at NTU Singapore.
I was required to obtain the Best Global University Ranking by manually copying from the website and pasting into an excel sheet. I was frustrated, as my eyes were tired, from looking at the computer screen continuously for long hours. Hence, I was thinking about whether there is a better way to do it.
At that time, I googled for automation and I found the answer for it – Web Scraping. Since then, I managed to create 100+ web crawlers and here is my first-ever web scraper that I would like to share.
Previously, what I did was to use requests plus BeautifulSoup to finish the task. However, after three years when I look back to the same website, I found out that there is a way to get the JSON data instead which works way faster.
If you are thinking of automating your boring and repetitive tasks, please promise me you’ll read till the end. You will learn how to create a web crawler so that you can focus on more value-added tasks.
In this article, I would like to share how I build a simple crawler to scrape universities’ rankings from usnews.com.
Inspect website

The first thing to do when you want to scrape the website is to inspect the web element. Why do we need to do that?
This is actually to find whether there exists a more robust way to get the data or to obtain cleaner data. For the former, I did not go into so deep to dig out the API this time. However, I do find a way to extract cleaner data so that I can reduce the data cleansing time.
If you do not know how to inspect the web element, you just need to navigate to any position of the webpage, right-click, click on inspect, then click on the Network tab. After that, refresh your page and you should see a list of network activities appear one by one. Let’s us look at the specified activity that I have been selecting using my cursor in the screenshot above(i.e. the "search?region=africa&..."
).
After that, please refer to the purple box in the screenshot above, highlighting the URL that the browser will send the request to in order to get the data to be presented to you.
Well, we can imitate the browser behavior by sending the request to that URL and get the data we need right? But before that, why do I choose to call the request URL instead of the original website URL?

Let’s click on the Preview tab, you will notice that all the information we need, including university ranking, address, countries, etc.. are all in the results field which is highlighted in the blue box.
This is the reason why we scrape through the URL. The data return by the URL is in a very nice format – JSON format.

The above screenshot shows a comparison of the code between today and 3 years before. 3 years before when I was a newbie in web scraping, I just use requests, BeautifulSoup, tons of XPATH and heavy data cleaning processes to get the data I need. However, if you compare the code that I have written today, I just need to use httpx to get the data and no data cleaning needed.
For your information, httpx is built on top of requests, but it supports additional functions like provides async APIs and with httpx, you can send an HTTP/2 request. For a more complete comparison, you may refer to this article.
Request URL:
So now, let’s pay attention to the link we are going to use as shown above. You will notice that you can change the values for region and subject as they are parameters to the URL. (For more information on URL parameters, here is a good read) However, do note that the values for them are limited to the region and subjects provided by the website.
For instance, you can change region=africa to region=asia or subjects=agricultural-sciences to subjects=chemistry. If you are interested to know what are the supported regions and subjects, you can visit my repo to check out.
After knowing how to query this URL to obtain the data you need, the left-over part is how many pages you need to query for a particular combination of region and subject.
So, let’s take this URL as an example, copy and paste the URL into your browser and press enter, then use command+f
to search for the keyword "last_page", and you will find a similar screenshot as below.

*Do note that I have installed a chrome extension that could help me to prettify the plain data into JSON format. This is why you can see that the data shown in my browser is nicely printed.
Congratulation, you manage to find the last_page variable as indicated above. Now, the only remaining process is how to go to the next page and get the data if the last_page is larger than 1.
Here is how I figure out the way to navigate to page 2. Take this link as an example.

First, click on the page number 2, and then view on the right panel. Pay attention to the purple box, you will notice there is an addition of page=2 in the Request URL. This means that you just need to append &page={page_number}
to the original request URL in order to navigate through different pages.
Now, you have the whole idea of how to create a web scraper to obtain the data from the website.
If you would like to have a look at the full Python code, feel free to visit here.
Final Thought

Thank you so much for reading until the end.
Here is what I want you to get after reading this article.
- Know that there are many different ways to scrape the data from a website, for instance getting the link to obtain the data in JSON format.
- Spend some time to inspect the website, if you manage to find the API to retrieve data, this can save you a lot of time.
The reason that I am comparing my code from 3 years before and the code I have written today is to give you an idea of how you can improve your web crawling and coding skills through continuous practice.
Work hard, the result will definitely come. – Low Wei Hong
If you have any questions or ideas to ask or add on, feel free to comment below!
About the Author
Low Wei Hong is a Data Scientist at Shopee. His experiences involved more on crawling websites, creating data pipeline and also implementing machine learning models on solving business problems.
He provides crawling services that can provide you with the accurate and cleaned data which you need. You can visit this website to view his portfolio and also to contact him for crawling services.
You can connect with him on LinkedIn and Medium.