A Minimalist End-to-End Scrapy Tutorial (Part II)

Systematic Web Scraping for Beginners

Harry Wang
Towards Data Science

--

Photo by Igor Son on Unsplash

Part I, Part II, Part III, Part IV, Part V

In Part I, you learned how to setup Scrapy project and write a basic spider to extract web pages by following page navigation links. However, the extracted data are merely displayed to the console. In Part II, I will introduce the concepts of Item and ItemLoader and explain why you should use them to store the extracted data.

Let’s first look at Scrapy Architecture:

As you can see in step 7, and 8, Scrapy is designed around the concept of Item, i.e., the spider will parse the extracted data into Items and then the Items will go through Item Pipelines for further processing. I summarize some key reasons to use Item:

  1. Scrapy is designed around Item and expect Items as outputs from the spider — you will see in Part IV that when you deploy the project to ScrapingHub or similar services, there are default UIs for you to browse Items and related statistics.
  2. Items clearly define the common output data format in a separate file, which enables you to quickly check what structured data you are collecting and prompts exceptions when you mistakenly create inconsistent data, such as by mis-spelling field names in your code — this happens more often than you think :).
  3. You can add pre/post processing to each Item field (via ItemLoader), such as trimming spaces, removing special characters, etc., and separate this processing code from the main spider logic to keep your code structured and clean.
  4. In Part III, you will learn how to add different item pipelines to do things like detecting duplicate items and saving items to the database.

At the end of Part I, our spider yields the following data:

yield {
'text': quote.css('.text::text').get(),
'author': quote.css('.author::text').get(),
'tags': quote.css('.tag::text').getall(),
}

and

yield {
'author_name': response.css('.author-title::text').get(),
'author_birthday': response.css('.author-born-date::text').get(),
'author_bornlocation': response.css('.author-born-location::text').get(),
'author_bio': response.css('.author-description::text').get(),
}

You may notice that author and author_name are the same thing (one is from the quote page, one is from the corresponding individual author page). So, we actually extract 6 pieces of data, i.e., quote text, tags, author name, birthday, born location, and bio. Now, let’s define the Item to hold those data.

Open the auto-generated items.py file and update its content as follows:

We just define one Scrapy item named “QuoteItem” with 6 fields to store the extracted data. Here, if you designed a relational database before, you may ask: should I have two items QuoteItem and AuthorItem to better represent the data logically? The answer is yes you could but not recommended in this case because items are returned by Scrapy in an asynchronous way and you will have additional logic added to match the quote item with its corresponding item — it’s much easier to put the related quote and author in one item in this case.

Now, you can put the extracted data into Item in the spider file like:

from tutorial.items import QuoteItem
...
quote_item = QuoteItem()
...
for quote in quotes:
quote_item['quote_content'] = quote.css('.text::text').get()
quote_item['tags'] = quote.css('.tag::text').getall()

Or a preferred way is to use ItemLoader as follows:

from scrapy.loader import ItemLoader
from tutorial.items import QuoteItem
...

for quote in quotes:
loader = ItemLoader(item=QuoteItem(), selector=quote)
loader.add_css('quote_content', '.text::text')
loader.add_css('tags', '.tag::text')
quote_item = loader.load_item()

Hmm, the code with ItemLoader seems more complicated — why do you want to do that? The quick answer is: the raw data you get from the css selector may need to be further parsed. For example, the extracted quote_content has quotation marks in Unicode that needs to be removed.

The birthday is a string and needs to be parsed into Python date format:

The born location has “in” in the extracted string that needs to be removed:

ItemLoader enables pre/post processing functions to be nicely specified away from the spider code and each field of the Item can have different sets of pre/post processing functions for better code reuse.

For example, we can create a function to remove the Unicode quotation marks aforementioned as follows:

MapCompose enables us to apply multiple processing functions to a field (we only have one in this example). ItemLoader returns a list, such as [‘death’, ‘life’] for tags. For author name, although there is one value a list is returned such as [‘Jimi Hendrix’] , TakeFirst processor takes the first value of the list. After adding additional processors, items.py looks like:

The key issue right now is that we load two fields quote_content and tags from the quote page and then issue another request to get the corresponding author page to load author_name, author_birthday, author_bornlocation, and bio . To do this, we need to pass the item quote_item from one page to another as metadata as follows:

yield response.follow(author_url, self.parse_author, meta={'quote_item': quote_item})

And in the author parsing function, you can get the item:

def parse_author(self, response):
quote_item = response.meta['quote_item']

Now, after adding Item and ItemLoader, our spider file looks like:

Run the spider scrapy crawl quotes in the console Scrapy stats, you can find the total number of items extracted:

2019-09-11 09:49:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...
'downloader/request_count': 111,
...
'item_scraped_count': 50,

Congrats! you have finished Part II of this tutorial.

Part I, Part II, Part III, Part IV, Part V

--

--