A Minimalist End-to-End Scrapy Tutorial (Part III)

Systematic Web Scraping for Beginners

Harry Wang
Towards Data Science

--

Photo by Sarah Dorweiler on Unsplash

Part I, Part II, Part III, Part IV, Part V

In Part II, you have extracted all the required data from the website and stored them in Items. In Part III, I will introduce Item Pipelines to save the extracted data into a database using ORM (SQLAlchemy) and handle the duplicate data issue.

Each item returned by the spider is sent to Item Pipelines (if any) sequentially for additional processing, such as saving items to the database, data validation, removing duplicates, etc. Item pipelines are defined as classes in the pipelines.py file, open this autogenerated file, you can see one empty pipeline has been defined named “TutorialPipeline”:

You need to specify which pipeline is enabled and the sequence of pipelines in settings.py file — by default, no pipeline is enabled. To enable the empty pipeline above, comment out the following part in settings.py :

The integer values (normally ranging from 0 to 1000), such as 300 as shown above, determine the execution order of the pipelines (lower valued pipeline runs first).

Next, let’s develop a pipeline to save the items to a database. Here, I use Object-Relational Mapping (ORM) to query and manipulate data from the database using the object-oriented paradigm. In particular, I use SQLAlchemy. I won’t cover the details of ORM and please refer to this article for some Pros and Cons.

First, let’s design the database schema. Note that there are 6 fields in the item, e.g., quote_content, tags, author_name, author_birthday, author_bornlocation, and bio. I am going to use three tables to store these data, i.e., quote, tag, author. There is a many-to-many relationship between quote and tag (one quote can one or more tags and one tag can associate with one or more quotes) and a one-to-many relationship between author and quote (one author can have one or more quotes but one quote belongs to only one author).

To define this schema using ORM via SQLAlchemy, you need to:

  • add SQLAlchemy>=1.3.6 in requirements.txt and install the package via running pip install -r requirements.txt in the virtual environment
  • create a models.py with the following content:

db_connect() function use create_engine(get_project_settings().get(“CONNECTION_STRING”)) to connect to a database. CONNECTION_STRING is specified in settings.py file. You can change the connection string to connect to different database systems, such as SQLite, MySQL, Postgres without changing your code. In this tutorial, I use SQLite, which essentially is a local file named scrapy_quotes.db created in the root folder when the first time the spider runs.

CONNECTION_STRING = ‘sqlite:///scrapy_quotes.db’

I also provide an example to connect to MySQL (commented out):

# MySQL
CONNECTION_STRING = "{drivername}://{user}:{passwd}@{host}:{port}/{db_name}?charset=utf8".format(
drivername="mysql",
user="harrywang",
passwd="tutorial",
host="localhost",
port="3306",
db_name="scrapy_quotes",
)

Now, let’s create the pipeline to save items to the database. Open pipelines.py file and add the following class (pipeline):

Make sure you also import the required packages and functions:

The init function below uses the functions in models.py to connect to the database (db_connect) and create tables (create_table) if not existed yet (otherwise ignored).

in process_item function, I first create instances for the database session and three tables. Then, I assign the author info and quote text values to corresponding table columns.

Next, we need to check whether the author and tags of the current item already exist in the database or not and only create new author/tag if they don’t exist so far:

In the end, I add the quote to the database:

Note that you don’t need to add author and tag explicitly due to the relationships you specified in ORM (quote.author and quote.tags) — the new author/tags (if any) will be created and inserted automatically by SQLAlchemy.

Now, run the spider scrapy crawl quotes , you should see a SQLite file named scrapy_quotes.db created. You can open the file to see the extracted content using SQLite command line:

$ sqlite3 scrapy_quotes.db
...
sqlite> .tables
author quote quote_tag tag
sqlite> select * from quote limit 3;1|The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.|12|Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.|23|The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.|3sqlite> .quit

or DB Browser for SQLite:

Note, that we have 50 quotes extracted. Assume the website may add additional quotes and you would like to run the spider once a week to collect the new ones if any. So, let’s run the spider scrapy crawl quotes again and you may notice a problem: we now have 100 quotes in the database — the same 50 quotes are extracted and stored again!

Next, let’s add another pipeline to check the item to see whether the item is a duplicate, if yes, drop the item so that the item won’t go through the rest of the pipelines.

Open pipelines.py file and add the following class (pipeline):

Make sure to import the DropItem exception: from scrapy.exceptions import DropItem . The logic is simple: do a database query to see whether the current item’s quote text already exists, if yes, drop the item. Now, you need to enable this pipeline in settings.py and make sure the duplicates pipeline is executed before the save to database pipeline:

You can delete the SQLite file first and run the spider a few times, you will see that only the first time the database is populated with 50 quotes. After that, you can see the warning information indicating the duplicate items have been dropped.

2019-09-12 11:16:04 [scrapy.core.scraper] WARNING: Dropped: Duplicate item found...
2019-09-12 11:16:04 [scrapy.core.engine] INFO: Closing spider (finished)
2019-09-12 11:16:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...
'item_dropped_count': 50,'item_dropped_reasons_count/DropItem': 50,...

You have finished Part III!! Cheers. In Part IV, I will show you how to deploy the spider to do periodic crawling and monitoring, e.g., run the spider automatically every 10 minutes.

Part I, Part II, Part III, Part IV, Part V

--

--