A Minimalist End-to-End Scrapy Tutorial (Part I)

Systematic Web Scraping for Beginners

Harry Wang
Towards Data Science

--

Photo by Paweł Czerwiński on Unsplash

Part I, Part II, Part III, Part IV, Part V

Web scraping is an important skill for data scientists. I have developed a number of ad hoc web scraping projects using Python, BeautifulSoup, and Scrapy in the past few years and read a few books and tons of online tutorials along the way. However, I have not found a simple beginner level tutorial that is end-to-end in the sense that covers all basic steps and concepts in a typical Scrapy web scraping project (therefore Minimalist in the title) — that’s why I am writing this and hope the code repo can serve as a template to help jumpstart your web scraping projects.

Many people ask: should I use BeautifulSoup or Scrapy? They are different things: BeautifulSoup is a library for parsing HTML and XML and Scrapy is a web scraping framework. You can use BeautifulSoup instead of Scrapy build-in selectors if you want but comparing BeautifulSoup to Scrapy is like comparing the Mac keyboard to the iMac or a better metaphor as stated in the official documentation “like comparing jinja2 to Django” if you know what they are :) — In short, you should learn Scrapy if you want to do serious and systematic web scraping.

TL;DR, show me the code:

In this tutorial series, I am going to cover the following steps:

  1. (This tutorial) Start a Scrapy project from scratch and develop a simple spider. One important thing is the use of Scrapy Shell for analyzing pages and debugging, which is one of the main reasons you should use Scrapy over BeautifulSoup.
  2. (Part II) Introduce Item and ItemLoader and explain why you want to use them (although they make your code seem more complicated at first).
  3. (Part III) Store the data to the database using ORM (SQLAlchemy) via Pipelines and show how to set up the most common One-to-Many and Many-to-Many relationships.
  4. (Part IV) Deploy the project to Scrapinghub (you have to pay for service such as scheduled crawling jobs) or set up your own servers completely free of charge by using the great open source project ScrapydWeb and Heroku.
  5. (Part V) I created a separate repo (Scrapy + Selenium) to show how to crawl dynamic web pages (such as a page that loads additional content via scrolling) and how to use proxy networks (ProxyMesh) to avoid getting banned.

Some prerequisites:

Let’s get started!

First, create a new folder, setup Python 3 virtual environment inside the folder, and install Scrapy. To make this step easy, I created a starter repo, which you can fork and clone (see Python3 virtual environment documentation if needed):

$ git clone https://github.com/yourusername/scrapy-tutorial-starter.git
$ cd scrapy-tutorial-starter
$ python3.6 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Your folder should look like the following and I assume we always work in the virtual environment. Note that we only have one package in the requirements.txt so far.

run scrapy startproject tutorial to create an empty scrapy project and your folder looks like:

Two identical “tutorial” folders were created. We don’t need the first level “tutorial” folder — delete it and move the second level “tutorial” folder with its contents one-level up — I know this is confusing but that’s all you have to do with the folder structure. Now, your folder should look like:

Don’t worry about the auto-generated files so far, we will come back to those files later. This tutorial is based on the official Scrapy tutorial. Therefore, the website we are going to crawl is http://quotes.toscrape.com, which is quite simple: there are pages of quotes with authors and tags:

When you click the author, it goes to the author detail page with name, birthday, and bio.

Now, create a new file named “quotes-spider.py” in the “spider” folder with the following content:

You just created a spider named “quotes”, which sends a request to http://quotes.toscrape.com and gets the response from the server. However, the spider does not do anything so far when parsing the response and simply outputs a string to the console. Let’s run this spider: scrapy crawl quotes , you should see the output like:

Next, let’s analyze the response, i.e., the HTML page at http://quotes.toscrape.com using Scrapy Shell by running:

$ scrapy shell http://quotes.toscrape.com/...2019-08-21 20:10:40 [scrapy.core.engine] INFO: Spider opened2019-08-21 20:10:41 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)2019-08-21 20:10:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)[s] Available Scrapy objects:[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)[s]   crawler    <scrapy.crawler.Crawler object at 0x105d01dd8>[s]   item       {}[s]   request    <GET http://quotes.toscrape.com/>[s]   response   <200 http://quotes.toscrape.com/>[s]   settings   <scrapy.settings.Settings object at 0x106ae34e0>[s]   spider     <DefaultSpider 'default' at 0x106f13780>[s] Useful shortcuts:[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)[s]   fetch(req)                  Fetch a scrapy.Request and update local objects[s]   shelp()           Shell help (print this help)[s]   view(response)    View response in a browser>>>

You can select elements using either Xpath selector or CSS selector and Chrome DevTools is often used to analyze the page (we won’t cover the selector details, please read the documents to learn how to use them):

For example, you can test the selector and see the results in Scrapy Shell — assume we want to get the quote block shown above:

You can either use Xpath response.xpath(“//div[@class=’quote’]”).get() (.get() shows the first selected element, use .getall() to show all) or CSSresponse.css(“div .quote”).get() . I bolded the quote text, author, and tags we want to get from this quote block:

>>> response.xpath("//div[@class='quote']").get()'<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> \n            \n            <a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n            \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/">world</a>\n            \n        </div>\n    </div>'

We can proceed in the shell to get the data as follows:

  • get all quote blocks into “quotes”
  • use the first quote in “quotes”: quotes[0]
  • try the css selectors
>>> quotes = response.xpath("//div[@class='quote']")
>>> quotes[0].css(".text::text").getall()
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']>>> quotes[0].css(".author::text").getall()['Albert Einstein']>>> quotes[0].css(".tag::text").getall()['change', 'deep-thoughts', 'thinking', 'world']

It seems that the selectors shown above get what we need. Note that I am mixing Xpath and CSS selectors for the demonstration purpose here — no need to use both in this tutorial.

Now, let’s revise the spider file and use keyword yield to output the selected data to the console (note that each page has many quotes and we use a loop to go over all of them):

import scrapyclass QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com']def parse(self, response):
self.logger.info('hello this is my first spider')
quotes = response.css('div.quote')
for quote in quotes:
yield {
'text': quote.css('.text::text').get(),
'author': quote.css('.author::text').get(),
'tags': quote.css('.tag::text').getall(),
}

Run the spider again: scrapy crawl quotes and you can see the extracted data in the log:

You can save the data in a JSON file by running: scrapy crawl quotes -o quotes.json

So far, we get all quote information from the first page, and our next task is to crawl all pages. You should notice a “Next” button at the bottom of the front page for page navigation — the logic is: click the Next button to go to the next page, get the quotes, click Next again till the last page without the Next button.

Via Chrome DevTools, we can get the URL of the next page:

Let’s test it out in Scrapy Shell by running scrapy shell http://quotes.toscrape.com/ again:

$ scrapy shell http://quotes.toscrape.com/
...
>>> response.css('li.next a::attr(href)').get()'/page/2/'

Now we can write the following code for the spider to go over all pages to get all quotes:

next_page = response.urljoin(next_page) gets the full URL and yield scrapy.Request(next_page, callback=self.parse) sends a new request to get the next page and use a callback function to call the same parse function to get the quotes from the new page.

Shortcuts can be used to further simplify the code above: see this section. Essentially, response.follow supports relative URLs (no need to call urljoin) and automatically uses the href attribute for <a> . So, the code can be shortened further:

for a in response.css('li.next a'):
yield response.follow(a, callback=self.parse)

Now, run the spider again scrapy crawl quotes you should see quotes from all 10 pages have been extracted. Hang in there — we are almost done for this first part. The next task is to crawl the individual author's page.

As shown above, when we process each quote, we can go to the individual author’s page by following the highlighted link — let’s use Scrapy Shell to get the link:

$ scrapy shell http://quotes.toscrape.com/
...
>>> response.css('.author + a::attr(href)').get()
'/author/Albert-Einstein'

So, during the loop of extracting each quote, we issue another request to go to the corresponding author’s page and create another parse_author function to extract the author’s name, birthday, born location and bio and output to the console. The updated spider looks like the following:

Run the spider again scrapy crawl quotes and double-check that everything you need to extract is output to the console correctly. Note that Scrapy is based on Twisted, a popular event-driven networking framework for Python and thus is asynchronous. This means that the individual author page may not be processed in sync with the corresponding quote, e.g., the order of the author page results may not match the quote order on the page. We will discuss how to link the quote with its corresponding author page in the later part.

Congratulations, you have finished Part I of this tutorial.

Learn more about Item and ItemLoader in Part II.

Part I, Part II, Part III, Part IV, Part V

--

--