kaggle.com

Web Scraping: A Brief Overview of Scrapy and Selenium, Part I

Thoughts on a scraper design that could save your time

Anastasia Reusova
Towards Data Science
7 min readDec 4, 2018

--

In this post, I am sharing my first experience with web scraping and the tools I have used (Scrapy and Selenium). I hope this piece will be helpful to someone seeking for general guidance as I am covering the learnings I find valuable and things I wish I knew when the idea of scraping crossed my mind the first time. Specifically, I wanted to highlight the peculiarities of using the two tools together and when to use what, as many of the explanations that I found online focused on either one or the other. I will not get into the code details, which I will do in a separate post, but will rather go through the conceptual understanding that I developed over time, with examples from Airbnb, which I find to be a good role model for the subject. As a side note, I am going to use the terms web scraping, scraping and web crawling as synonyms.

F
irst of all, it’s not rocket science.

Arguably, the best approach to kick off this kind of a project is learning by doing, and for certain websites, you can build a working scraper in a couple of days with a basic knowledge of Python and a decent tutorial at hand. I started learning from this [really helpful] Scrapy course, which costs about $10 when on sale. For the most part, the course covers the use of Scrapy for web crawling, but also touches upon the use of Selenium. These two can be used separately and merged together in one scraper. Merging them might require some additional research if you are a Python / JavaScript beginner, but it’s totally worth it. Personally, I find Corey’s YouTube channel to be of a great help when it comes to brushing up Python basics as he has a great way of breaking down the concepts.

D
ifferent websites — different tools
.
While Scrapy is a Python framework that is specifically designed for web crawling, its most suitable for properly rendered XML and HTML pages, and may not work for JavaScript driven pages, which use frameworks like React and Angular. In practice, it means that you will pass a valid element selector to Scrapy, but will get an empty output. An example for this are different kinds of timers and interactive elements. Another peculiarity of Scrapy is that it goes through pages by accessing their URLs, however, you will find that some buttons won’t have any URLs linked to them when you inspect the element or get the source code (through xpath or css). For example, this guided tour has an href (URL), attached to it, so you can get redirected to the tour info.

airbnb.ae

On the other hand, this Airbnb “Become a host” button has no href (=URL) when you inspect the source code.

airbnb.ae

A different example for the last scenario is infinite load pages, and “load more” or “next” buttons in some instances. Like this “show all” button:

airbnb.ae

In these cases, if you want to use Python, you will turn to other tools, like Selenium, which I found to be a fairly beginner-friendly but less optimised scraping tool. Specifically, Selenium makes it easy to interact with the website, or simply click through pages, while getting to the element of my interest.

At the same time, Selenium is clumsy at handling certain exceptions that Scrapy handles gracefully. One of such examples is the NoSuchElementException, for which Scrapy returns an empty list, while Selenium fails to return all the elements for a particular page. For example, consider this review count for homes on Airbnb, if a property has a review, the counter is displayed, you can see it in the class="_1lykgvlh", inside the span.

airbnb.ae

The property below, however, has no reviews and the counter is not there as an element of the source code, and there’s nothing to “inspect” in the same class="_1lykgvlh":

airbnb.ae

So if you are looping through all these classes to get all the elements from it, such as “new” tag, reviews count and “free cancellation” tag, Selenium will return all these elements for the first property and drop off of these for the second one (even if only finding 1 element triggers the NoSuchElementException). For this reason, handling this and all other exceptions in Selenium is important, so your scraper is robust and functional.

One of peculiarities of Selenium is that it has to open a browser for each request to get the url. This means that Selenium is a memory intensive tool, and you may run into memory utilisation issues. For this reason, I chose to use Selenium only when necessary and not overuse it. In Airbnb example, if I have a choice, I will scrape all the properties’ details from the catalog page rather than going into each property profile to scrape details from there and returning to the catalog.

S
crapers are not universal
Needless to say, different websites will require building different scrapers unless they have identical source code. That said, a scraper written for a specific website might need to be altered once the website changes, so you may need to adjust the script. As an example, developers might change a class name or an id of an element, which will leave you either with exceptions or empty results. For this reason, it is worth monitoring the scraping process in a browser, terminal or simply looking at the output file.

B
e nice. And avoid getting blocked

In general, be nice and approach the server gently. If the website has an API, use it, if not, and you really need them, be gentle to not crash their server and avoid getting your IP blocked, so get the hang of DOWNLOAD_DELAY, sleep(), setting the limit for concurrent requests or other ways to pause your scraper. A good idea is to avoid launching the scraper from your main work station at least in the beginning when you are getting familiar with how it behaves. Because if the IP is blocked or gets labeled as suspicious, it might be painful not only for you, but also for your colleagues, family and anyone using the same network. This means, be especially mindful of strategically important websites such as google.com (which doesn’t block you, but invites you for CAPCHAs sessions).

I like to familiarise myself with the site’s robots policy and read the robots.txt to get a general understanding of what they prefer to allow and disallow for robots to do, and search their website for an indication of whether or not they allow robots on their website.

robots.txt for airbnb.ae

A mere look at the list of restrictions will give you the impression of how strict they are with crawler. Websites can treat robots differently and scrapers are often blocked automatically. You can recognise this behaviour if you have 500's request statuses in your logs with Request denied or HTTP status is not handled or not allowed, etc.

In case there is no API and you keep getting 500’s after setting delays, you can set a USER_AGENT for your scraper, which will change the header of it from pythonX.X or any other default name, which is easily identified and filtered by the server, to the name of the agent you’ve specified, so the server will see your bot as a browser. One of the easiest way to do it in Scrapy is through settings. Keep in mind though that you want to keep the user agent in line with your machine OS and browser name. For example, this USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36' will work for Mac, but will not work for Ubuntu.

T
here are plenty of tools out there

Scrapy and Selenium are not the only options for web crawling. There are many Python libraries (e.g., BeautifulSoup, urllib, lxml, Requests) and other tools like puppeteer by Google (Node.js), which can deliver similar results. The difference is in the frameworks they can handle and at what cost. Therefore, your objective is to get familiar with their capabilities and use them in the most efficient way.

These are the main things I learnt when I did my first couple of scraping projects. As with most of things, the hardest one was to start and not get lost. Therefore, I do recommend taking an online course, like this Udemy course, which I found really helpful, and build up understanding gradually if you are a beginner.

This was Part I of this post, I am following up with the Part II, where I will share a Python code with you and include explanations of what it does, so you can replicate it.

Comment below if you have questions and connect with me on LinkedIn if you want to network.

LinkedIn: https://www.linkedin.com/in/areusova/
GitHub: https://github.com/khunreus

--

--