Web Scraping with Beautiful Soup, Selenium, or Scrapy?

Find the best scraping tool for your Python project

The PyCoach
Towards Data Science

--

Web Scraping with Beautiful Soup, Selenium or Scrapy
Photo by 卡晨 on Unsplash

The internet is full of data available for you to start your Data Science project. Obtaining that data could be as simple as copying and pasting it, but when it comes to large data, web scraping is the best solution. However, if you google “how to web scrape with Python,” you’ll get many tutorials using different Python libraries and frameworks.

In this guide, we’ll analyze the 3 most popular web scraping tools in Python, so you can choose the one that suits best to your project.

Beautiful Soup

Beautiful Soup can pull data out of HTML and XML files. On top of that, it’s the easiest to learn among the 3 options.

However, Beautiful Soup has some dependencies, such as the need of the request library to make requests to the website and the use of external parsers to extract data; for example, XML and HTML parser. These dependencies make it complicated to transfer code between projects.

Let’s see a simple example for extracting data with Beautiful Soup:

As we can see, only a few lines of code are needed to extract data with BeautifulSoup, but we need to import requests to access the website and html.parser to parse the content.

Selenium

Selenium wasn’t originally designed for web scraping. In fact, Selenium is a web driver designed to render web pages for test automation of web applications.

This makes Selenium great for web scraping because many websites rely on JavaScript to create dynamic content on the page. Other web scraping tools like Beautiful Soup don’t have this functionality, limiting the extraction of data available on most websites.

Selenium is not as easy to learn as Beautiful Soup; however, it’s still a friendly tool since it allows code to mimic human behavior such as clicking on a button, selecting dropdown menus, maximizing windows, etc.

Let’s see an example of extracting data with Selenium:

One of the disadvantages of Selenium is speed. Web scraping with Selenium is slower than HTTP requests to the web browser because all the scripts present on the web page will be executed. However, if speed isn’t a top priority, Selenium will be a good option.

Scrapy

Scrapy is a web scraping framework built especially for web scraping and written entirely in Python. It’s built on top of Twisted, an asynchronous network framework, which allows applications to respond to different network connections without using traditional threading models.

One of the biggest advantages of Scrapy is speed. Since it’s asynchronous, Scrapy spiders don’t have to wait to make requests one at a time, but it can make requests in parallel. This increases efficiency, which makes Scrapy memory and CPU efficient compared to the previous web scraping tools analyzed.

Image found on Wikimedia Commons

Some drawbacks of Scrapy is that it doesn’t handle JavaScript by default, but it relies on Splash to do the job. Also, the learning curve to learn Scrapy is steeper than tools like Beautiful Soup and the installation process and setup can be a bit complicated.

Would you like to make money from web scraping without selling data? If so, check this article I wrote:

Which one is the best scraping tool?

After analyzing each scraping tool's pros and cons, let’s see which one excels in different scenarios.

Ideal Use Case

When it comes to large scale projects, Scrapy is the best option because of its architecture and functionalities. It also facilitates project migration, which benefits large projects.

Beautiful Soup would fit better for small and simple projects, while Selenium would be in the middle between these two options since it can extract data from websites that use JavaScript, but the approach it uses for web scraping isn’t efficient.

Performance

Scrapy is the one with the best speed since it’s asynchronous, built especially for web scraping, and written in Python. However, Beautiful soup and Selenium are inefficient when scraping large amounts of data.

Ease of use

Beautiful Soup is the easiest option for web scraping. Its simplicity and straightforward approach help beginners learn web scraping fast. Selenium and Scrapy aren’t as simple as Beautiful Soup, but they’re not difficult to learn either.

Summary

Below you’ll find a table with a summary of the functionalities of each web scraping tool we mentioned in this article.

In a nutshell, Scrapy is best when dealing with large projects where efficiency and speed are top priorities. Selenium excels in dealing with core javascript based web applications, but it’s good for projects where speed isn’t relevant. Finally, Beautiful Soup suits better for beginners who want to start simple web scraping projects.

Image by author

If you want to learn Python in Spanish, subscribe to my YouTube channel. Every week I publish videos like the one below.

--

--