
Selenium is a browser automation tool that can not only be used for testing, but also for many other purposes. In this article, we will use Selenium to navigate between webpages, so we can scrape the data off these pages.
We will scrape the code blocks from my Keras tutorial series, which is available on my website. For this we will navigate to each page, scrape the code blocks and then go back to the main page so we can repeat the process.
Installation
Selenium can be installed by typing:
pip install selenium
or
conda install selenium
A WebDriver for your favorite web browser must also be installed. The Firefox WebDriver(GeckoDriver) can be installed by going to this page and downloading the appropriate file for your operating system. After the download has finished the file has to be extracted.
Now the file can either be added to path or copied into the working directory. I chose to copy it to my working directory because I’m not using it that often.
Inspect the website
Before we can start navigating through webpages and scraping the data, we need to know how we can target the buttons used for navigation as well as how to target the code blocks.
For this we will use the developer tools, which are built-in in almost every browser.


First we will navigate to the Playlist Url to look what attributes the "Watch" buttons have. The buttons have two classes, as can be seen in the left picture. The btn
and btn-primary
classes.
The code blocks are div tags with the class code-toolbar
, and they contain a pre and code tag.
Now that we know what we are searching for we can start getting and scraping the data.
Trying to get HTML using Urllib
For those of you that wonder why we can’t just use something like urllib to get the data, this is because some websites(like mine) load the HTML using JavaScript, which is done after the document.onload
function, and therefore the data isn’t available at the point urllib tries to get it.
Nethertheless we will try it so we can see it clearly:
from bs4 import BeautifulSoup
import urllib.request
videos_url = "https://programmingwithgilbert.firebaseapp.com/videos/keras-tutorials"
page = urllib.request.urlopen(video_url)
soup = BeautifulSoup(page, 'html.parser')
print(soup)
This outputs:
<!DOCTYPE doctype html>
<html lang="en"><head><meta charset="utf-8"/><title>Gilbert Tanner</title><base href="/"/><meta content="width=device-width,initial-scale=1,minimum-scale=1,maximum-scale=1" name="viewport"><link href="favicon.ico" rel="icon" type="image/x-icon"/><link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.15.0/themes/prism-okaidia.css" rel="stylesheet"/><link crossorigin="anonymous" href="https://use.fontawesome.com/releases/v5.0.13/css/all.css" integrity="sha384-DNOHZ68U8hZfKXOrtjWvjxusGo9WQnrNx2sqG0tfsghAvtVlRW3tvkXWZh58N9jp" rel="stylesheet"/>...</meta></head><body><app-root></app-root><script src="inline.318b50c57b4eba3d437b.bundle.js" type="text/javascript"></script><script src="polyfills.fa62713060e7012f88ea.bundle.js" type="text/javascript"></script><script src="main.5cfcfbc297850bba435f.bundle.js" type="text/javascript"></script></body></html>
We only get the head-tag with all external links as well as the body-tag with only a few script tags in it.
Scrape data using Selenium
Selenium is able to simulate the browser, and so we can make it wait until the page finished loading before we are getting the data.
First we will import the libraries needed for scraping and processing the webdata. We will also define the url of the website we want to scrape the data from.
# imports
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import re
import os
# website urls
base_url = "https://programmingwithgilbert.firebaseapp.com/"
videos_url = "https://programmingwithgilbert.firebaseapp.com/videos/keras-tutorials"
Next we will use Seleniums webdriver
to open a new browser window.
# Firefox session
driver = webdriver.Firefox()
driver.get(videos_url)
driver.implicitly_wait(100)

Now that we have the browser window we need to get the buttons, so we can navigate to each individual page. Selenium can be used to find elements by using the locating elements. These include find_elements_by_link_text, find_elements_by_class_name
as well as find_element_by_id
.
First of we want to know how many individual pages we have. This can be found out by counting the occurrence of "Watch" buttons.
num_links = len(driver.find_elements_by_link_text('Watch'))
To navigate to one of the i pages the i button can be clicked.
for i in range(num_links):
# navigate to link
button = driver.find_elements_by_class_name("btn-primary")[i]
button.click()
To ensure the page has finished loading before we are starting the scraping process we will use the WebDriverWait
method and wait until the iframe, embedded into every single page, has loaded.
for i in range(num_links):
# navigate to link
button = driver.find_elements_by_class_name("btn-primary")[i]
button.click()
element = WebDriverWait(driver, 10).until(lambda x: x.find_element_by_id('iframe_container'))
We can get the HTML, by calling driver.page_source
and then we can use find_all
to find all divs with the class code-toolbar
.
code_blocks = []
for i in range(num_links):
# navigate to link
button = driver.find_elements_by_class_name("btn-primary")[i]
button.click()
# get soup
element = WebDriverWait(driver, 10).until(lambda x: x.find_element_by_id('iframe_container'))
tutorial_soup = BeautifulSoup(driver.page_source, 'html.parser')
tutorial_code_soup = tutorial_soup.find_all('div', attrs={'class': 'code-toolbar'})
The last step in getting the data is to get the inner text of each code block, and add it to an array. And then we need to ensure that we go back to the main page after we are finished with a page.
num_links = len(driver.find_elements_by_link_text('Watch'))
code_blocks = []
for i in range(num_links):
# navigate to link
button = driver.find_elements_by_class_name("btn-primary")[i]
button.click()
# get soup
element = WebDriverWait(driver, 10).until(lambda x: x.find_element_by_id('iframe_container'))
tutorial_soup = BeautifulSoup(driver.page_source, 'html.parser')
tutorial_code_soup = tutorial_soup.find_all('div', attrs={'class': 'code-toolbar'})
tutorial_code = [i.getText() for i in tutorial_code_soup]
code_blocks.append(tutorial_code)
# go back to initial page
driver.execute_script("window.history.go(-1)")
print(code_blocks)

This outputs an array of arrays containing all the code of my Keras tutorials.
[['import numpy as npnimport pandas as pdnimport matplotlib.pyplot as pltnfrom keras.datasets import mnistnfrom keras.utils import to_categorical ',
'def getData(): Copy',
'def getData():n (X_train, y_train), (X_test, y_test) = mnist.load_data()n img_rows, img_cols = 28, 28 ',
' y_train = to_categorical(y_train, num_classes=10)n y_test = to_categorical(y_test, num_classes=10) Copy',
' X_train = X_train.reshape(X_train.shape[0], img_rows, img_cols, 1)n X_test = X_test.reshape(X_test.shape[0], img_rows, img_cols, 1) ',...
Lastly you should always close the browser instance.
driver.quit()
Save data
Now that we have the data stored in an array we can save it to disk. We will save the code from each tutorial in a separate .txt
file
for i, tutorial_code in enumerate(code_blocks):
with open('code_blocks{}.txt'.format(i), 'w') as f:
for code_block in tutorial_code:
f.write(code_block+"n")
Conclusion
Selenium is a browser automation tool, which can be used for many purposes including testing and webscraping.
It can be used on its own, or in combination with another scraping library like BeautifulSoup.
If you liked this article consider subscribing on my Youtube Channel and following me on social media.
The code covered in this article is available as a Github Repository.
If you have any questions, recommendations or critiques, I can be reached via Twitter or the comment section.