The world’s leading publication for data science, AI, and ML professionals.

Web Scraping using Selenium and BeautifulSoup

How to use Selenium to navigate between pages and use it to scrap HTML loaded with JavaScript.

Figure 1: Photo by Maik Jonietz on Unsplash
Figure 1: Photo by Maik Jonietz on Unsplash

Selenium is a browser automation tool that can not only be used for testing, but also for many other purposes. In this article, we will use Selenium to navigate between webpages, so we can scrape the data off these pages.

We will scrape the code blocks from my Keras tutorial series, which is available on my website. For this we will navigate to each page, scrape the code blocks and then go back to the main page so we can repeat the process.


Installation

Selenium can be installed by typing:

pip install selenium
or
conda install selenium

A WebDriver for your favorite web browser must also be installed. The Firefox WebDriver(GeckoDriver) can be installed by going to this page and downloading the appropriate file for your operating system. After the download has finished the file has to be extracted.

Now the file can either be added to path or copied into the working directory. I chose to copy it to my working directory because I’m not using it that often.


Inspect the website

Before we can start navigating through webpages and scraping the data, we need to know how we can target the buttons used for navigation as well as how to target the code blocks.

For this we will use the developer tools, which are built-in in almost every browser.

Figure 2: Inspecting the website
Figure 2: Inspecting the website

First we will navigate to the Playlist Url to look what attributes the "Watch" buttons have. The buttons have two classes, as can be seen in the left picture. The btnand btn-primaryclasses.

The code blocks are div tags with the class code-toolbar, and they contain a pre and code tag.

Now that we know what we are searching for we can start getting and scraping the data.


Trying to get HTML using Urllib

For those of you that wonder why we can’t just use something like urllib to get the data, this is because some websites(like mine) load the HTML using JavaScript, which is done after the document.onload function, and therefore the data isn’t available at the point urllib tries to get it.

Nethertheless we will try it so we can see it clearly:

from bs4 import BeautifulSoup
import urllib.request
videos_url = "https://programmingwithgilbert.firebaseapp.com/videos/keras-tutorials"
page = urllib.request.urlopen(video_url)
soup = BeautifulSoup(page, 'html.parser')
print(soup)

This outputs:

<!DOCTYPE doctype html>
<html lang="en"><head><meta charset="utf-8"/><title>Gilbert Tanner</title><base href="/"/><meta content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=1" name="viewport"><link href="favicon.ico" rel="icon" type="image/x-icon"/><link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.15.0/themes/prism-okaidia.css" rel="stylesheet"/><link crossorigin="anonymous" href="https://use.fontawesome.com/releases/v5.0.13/css/all.css" integrity="sha384-DNOHZ68U8hZfKXOrtjWvjxusGo9WQnrNx2sqG0tfsghAvtVlRW3tvkXWZh58N9jp" rel="stylesheet"/>...</meta></head><body><app-root></app-root><script src="inline.318b50c57b4eba3d437b.bundle.js" type="text/javascript"></script><script src="polyfills.fa62713060e7012f88ea.bundle.js" type="text/javascript"></script><script src="main.5cfcfbc297850bba435f.bundle.js" type="text/javascript"></script></body></html>

We only get the head-tag with all external links as well as the body-tag with only a few script tags in it.


Scrape data using Selenium

Selenium is able to simulate the browser, and so we can make it wait until the page finished loading before we are getting the data.

First we will import the libraries needed for scraping and processing the webdata. We will also define the url of the website we want to scrape the data from.

# imports
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import re
import os
# website urls
base_url = "https://programmingwithgilbert.firebaseapp.com/"
videos_url = "https://programmingwithgilbert.firebaseapp.com/videos/keras-tutorials"

Next we will use Seleniums webdriverto open a new browser window.

# Firefox session
driver = webdriver.Firefox()
driver.get(videos_url)
driver.implicitly_wait(100)
Figure 3: Browser Window
Figure 3: Browser Window

Now that we have the browser window we need to get the buttons, so we can navigate to each individual page. Selenium can be used to find elements by using the locating elements. These include find_elements_by_link_text, find_elements_by_class_name as well as find_element_by_id .

First of we want to know how many individual pages we have. This can be found out by counting the occurrence of "Watch" buttons.

num_links = len(driver.find_elements_by_link_text('Watch'))

To navigate to one of the i pages the i button can be clicked.

for i in range(num_links):
    # navigate to link
    button = driver.find_elements_by_class_name("btn-primary")[i]
    button.click()

To ensure the page has finished loading before we are starting the scraping process we will use the WebDriverWait method and wait until the iframe, embedded into every single page, has loaded.

for i in range(num_links):
    # navigate to link
    button = driver.find_elements_by_class_name("btn-primary")[i]
    button.click()
    element = WebDriverWait(driver, 10).until(lambda x:    x.find_element_by_id('iframe_container'))

We can get the HTML, by calling driver.page_source and then we can use find_all to find all divs with the class code-toolbar.

code_blocks = []
for i in range(num_links):
    # navigate to link
    button = driver.find_elements_by_class_name("btn-primary")[i]
    button.click()
    # get soup
    element = WebDriverWait(driver, 10).until(lambda x: x.find_element_by_id('iframe_container'))
    tutorial_soup = BeautifulSoup(driver.page_source, 'html.parser')
    tutorial_code_soup = tutorial_soup.find_all('div', attrs={'class': 'code-toolbar'})

The last step in getting the data is to get the inner text of each code block, and add it to an array. And then we need to ensure that we go back to the main page after we are finished with a page.

num_links = len(driver.find_elements_by_link_text('Watch'))
code_blocks = []
for i in range(num_links):
    # navigate to link
    button = driver.find_elements_by_class_name("btn-primary")[i]
    button.click()
    # get soup
    element = WebDriverWait(driver, 10).until(lambda x: x.find_element_by_id('iframe_container'))
    tutorial_soup = BeautifulSoup(driver.page_source, 'html.parser')
    tutorial_code_soup = tutorial_soup.find_all('div', attrs={'class': 'code-toolbar'})
    tutorial_code = [i.getText() for i in tutorial_code_soup]
    code_blocks.append(tutorial_code)
    # go back to initial page
    driver.execute_script("window.history.go(-1)")
print(code_blocks)
Figure 4: Scraping data
Figure 4: Scraping data

This outputs an array of arrays containing all the code of my Keras tutorials.

[['import numpy as npnimport pandas as pdnimport matplotlib.pyplot as pltnfrom keras.datasets import mnistnfrom keras.utils import to_categorical ',
  'def getData(): Copy',
  'def getData():n    (X_train, y_train), (X_test, y_test) = mnist.load_data()n    img_rows, img_cols = 28, 28     ',
  '    y_train = to_categorical(y_train, num_classes=10)n    y_test = to_categorical(y_test, num_classes=10) Copy',
  '    X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)n    X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1) ',...

Lastly you should always close the browser instance.

driver.quit()

Save data

Now that we have the data stored in an array we can save it to disk. We will save the code from each tutorial in a separate .txt file

for i, tutorial_code in enumerate(code_blocks):
    with open('code_blocks{}.txt'.format(i), 'w') as f:
        for code_block in tutorial_code:
            f.write(code_block+"n")

Conclusion

Selenium is a browser automation tool, which can be used for many purposes including testing and webscraping.

It can be used on its own, or in combination with another scraping library like BeautifulSoup.

If you liked this article consider subscribing on my Youtube Channel and following me on social media.

The code covered in this article is available as a Github Repository.

If you have any questions, recommendations or critiques, I can be reached via Twitter or the comment section.


Related Articles