
What do you think about scraping or crawling websites?
A lot of people may view this skill as an automation tool, or just a skill which often people view as low-end skill.
For me, you can view it as a war, just that this war is happening on the internet. Especially when you are doing routine web crawling on a particular website, maybe for the first few times you are able to win, you will celebrate hell yeah this website is easy to scrape, but you haven’t realized on the other side, there are also people who can track your bot’s suspicious activity, and then trying to take down your bot from crawling the website.
Often times people may overlook the importance of retrieving good data sources, but to put more emphasis on how to build a better accuracy’s machine learning model. There is an acronym for building machine learning models, which is GIGO, short for "garbage in, garbage out". Therefore it is important to realize if you are able to retrieve good data sources, plus you already equip with great EDA and modeling skills, you will be able to build a better machine learning model.
Back to the topic, I am going to share with you guys, my journey from scraping to crawling websites. Giving you guys some background of mine, I graduated from Nanyang Technological University in Singapore with a Degree of Mathematics and Economics, not coming from a technical background, it is often harder to learn technical skills, but if you are working hard, eventually you will be good at it.

First Experience in Scraping for Automation
It started with a part-time job at the Institute of Statistics at Nanyang Technological University which I have to go to each website, copy and paste each page to the excel file when I was still studying in university. Basically, I was required to get all the information related to world rank and score of each university from this website.
After my part-time job, my eyes always felt very tired. That’s the reason I start to scrape websites. Nothing fancy, I just use Python library Request for scraping the website and BeautifulSoup to parse the HTML contents.
The result is great and which I found is really grateful for. The scraper not only saves my eyes for being so tired but also to improve my work efficiency. Being the first time for scraping, I found out that it is really easy to scrape websites using these two Python Packages.
If you are interested in this scraper, you can visit my github repo for more information.
Experience for crawling for a Data Science Project

Machine learning is my interest. It sparks my interest since my first internship at Dentsu Aegis Network Global Data Innovation Centre. Given a chance to witness projects involved in machine learning in digital marketing, I am truly impressed by the power of it. Therefore, to become a data scientist in the future, I told myself to do more projects involving machine learning.
I decided to do a project on prediction on rental price base on certain factors, for example, the distance between MRT and the rental unit, size of the room, number of bathrooms in the unit and etc. So, I decided to crawl property guru website which is one of the most popular websites to find a rental unit in Singapore.
This website is a dynamic website which requires me to build a bot that is interactive, therefore I chose Python package Selenium and BeautifulSoup to crawl the website. At first, I thought it seems I am able to build a scraper and retrieve the data quite easily, but the website implemented Completely Automated Public Turing test to tell Computers and Humans Apart(Captcha).

There is where I realized crawling is not easy. It involves a deep understanding of the particular website so that you will be able to retrieve the data. After putting much effort into understanding the possible reasons for being blocked by the website, I came out with a way to mimic human behavior and finally it works like a charm.
Long story short, then I am able to apply machine learning models for prediction and the result seems pretty well after applying EDA to create multiple features. This would not be possible if I am not able to get accurate and cleaned data for my machine learning model.
Experience for crawling during work
After I graduated, I started my first job as a Business Intelligence position at Shopee. I am responsible for crawling around 120k of items daily for a competitor’s analysis. Here is where I really improve vastly for my crawling skills. My bot was once again got blocked by Captcha and that is the reason why I learn a new Python package for crawling, Scrapy. It is definitely a great package for crawling. For details of the comparison between Scrapy and Selenium package, please feel free to visit this website: https://hackernoon.com/scrapy-or-selenium-c3efa9df2c06
Yeah I managed to solve it, but this time what I think I have learned more in are listed as follow:
- Maintain a database for past data so that they can be used for analysis.
- Build a dashboard to monitor several crawlers’ performance, so that I am able to amend the code as fast as possible when problems occurred.
- Techniques on how to bypass anti-scraping measure, or to create a more efficient crawler, for more information you can visit this website: https://towardsdatascience.com/https-towardsdatascience-com-5-tips-to-create-a-more-reliable-web-crawler-3efb6878f8db
- Techniques to retrieve sensitive data, which may require you to use the POST method.
Final Thought
I am currently working as a Data Scientist, and what I can inform you is that crawling is still very important. I really hope this article will help and inspire you to solve some problems when facing difficulties in web crawling.
Thank you for reading this post. Feel free to leave comments below on topics that you may be interested to know. I will be publishing more posts in the future about my experiences and projects.
About the Author
Low Wei Hong is a Data Scientist at Shopee. His experiences involved more on crawling websites, creating data pipelines and also implementing machine learning models on solving business problems.
He provides crawling services that are able to provide you the accurate and cleaned data which you need. You can visit this website to view his portfolio and also to contact him for crawling services.