What Makes a Song Great

What Makes a Song Great? Part 1

Webscraping dynamically generated content in Python with Selenium and BeautifulSoup

Bernardino Sassoli

Published in

Towards Data Science

7 min readJul 23, 2020

[This is the first in a series of three articles]

Web scraping, visualisation, list comprehension, regular expression, pandas! This project has it all — including pickles !

[UPDATE Sept. 1st, 2020: since the first publication of this article, Rolling Stone has changed the source code to its page. As a result I have updated the code to allow the retrieval of dynamically generated contents.]

The standard advice once you have learned some coding and basic data science skills is to do projects, a lot of them. Unfortunately, many of us have a hard time finding just such projects.

A few days ago, I was going through Rolling Stone’s list of The greatest 500 songs of all time. I began to ask myself: ‘Who has the most songs in the list?’, or: ‘Will be skewed towards some decade far in the past, given that probably the critics were not exactly in their 20s?’. I had long been looking for a project that would allow me to combine some web scraping techniques with some Exploratory Data Analysis (EDA): the day after I started this. I had loads of fun doing it: I hope by sharing it you too will learn something and have fun.

What I will showcase here goes through many libraries, tools, and skill-sets: it’s an end-to-end project that starts with data retrieval and ends with visualizations, touching upon parsing, cleaning, and analyzing data. Some of the things we will be touching on:

web scraping (using BeautifulSoup and Selenium)
regular expressions (using Python’s re module)
APIs (namely Spotify’s) (using spotipy)
data analysis and visualization (with pandas and matplotlib).

I would consider this project suitable for advanced beginners to intermediate-level coders. Nothing in it is complex per se: but it does involve many diverse areas. Note that some very basic knowledge of HTML and CSS might be useful for the first part.

Webscraping: getting the data and cleaning it

First, let’s import the libraries we need.

# webscraping libraries
import urllib # to retrieve web pages
from bs4 import BeautifulSoup # to parse web pages
from selenium import webdriver # to retrieve dynamically generated content
import time  # allows us to wait before scraping or interacting with web pages# data, cleaning, analysis and visualization
import pandas as pd # the goto python library for data cleaning and analysis
import matplotlib.pyplot as plt # the goto python library for data visualization
import seaborn as sns # data visualization (but nicer!)
import re # Python's library for regular expressions (see more below)# to interact with Spotify's API
import spotipy # to query Spotify's API
from spotipy.oauth2 import SpotifyClientCredentials # for API login
# display charts in jupyter notebook
%matplotlib inline

Step 1: Scraping dynamically generated content with Selenium and Beautiful Soup

Open your browser and navigate to our list. First of all, note that the list is ‘batched’ in groups of 50. This tells us we will probably need to iterate over different addresses to get our data once we start scraping — more on that later.

Scrolling down we find our songs. There are five categories of data we want to retrieve:

Artist
Song title
Writer(s)
Producer(s)
Release date

If you ctrl-click (on a Mac) on an element of the page and select Inspect from the menu that pops up, you will see the corresponding HTML highlighted. In this case, to get the artist and the song's title we'll need to look for an element with tag <h2> belonging to the class c-gallery_vertical-album__title.

Ordinarily, we'd retrieve the page using urllib and pass the results to BeautifulSoup with the parameter html.parser: BeatifulSoup would then parse the HTML we retrieved and allow us to find the elements we identified with the find_all method.
However, it turns out that the page is generated dynamically (if you look at the source code you won't find any of those elements). So first we will use Selenium to simulate a browser opening the page and only then retrieve the contents.

def get_page_source(url):
    """
    Input: a str (the target url)
    Returns: a Beautiful Soup object (the parsed page source)
    -----------------------------------------------------------------------
    Retrieves the target page's contents and passes them to Beautiful Soup.
    -----------------------------------------------------------------------
    """
    options = webdriver.ChromeOptions()
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito')
    options.add_argument('--headless')
    driver = webdriver.Chrome(options=options)
    time.sleep(10) # sleep for 10s to allow the page to load
    target_page = url
    driver.get(target_page) 
    page_source = driver.page_source #get the contents
    soup =  BeautifulSoup(page_source)
    driver.quit() # close the driver
    return souptarget_url ="https://www.rollingstone.com/music/music-lists/500-greatest-songs-of-all-time-151127/smokey-robinson-and-the-miracles-shop-around-71184/"
soup = get_page_source(target_url) # pass the HTML contents to Beautiful Soup for parsing.

Now using Beautiful Soup’s find_all method, and passing the appropriate CSS identifier, we can retrieve all the items we targeted.

song_all = soup.find_all('h2', {'class':'c-gallery-vertical-album__title'})
song_all[0]<h2 class="c-gallery-vertical-album__title">Smokey Robinson and the Miracles, 'Shop Around'</h2>

Ok, we are onto something. Let’s dig further.

Step 2: Cleaning with Regex

The list’s items contain not just the data we are looking for, but also the HTML tags. To only get the data we use get_text() as follows:

song_all[0].get_text()"Smokey Robinson and the Miracles, 'Shop Around'"

Darn. A lot of formatting, whitespace, extra-quotes. I chose to use Regular Expression or Regex via the re module to clean the data.

RegEx is a powerful micro-programming language that proves invaluable to search for patterns of characters in strings. It has a pretty steep learning curve, so I try to use it as often as possible: as they say: practice, practice, practice! (Click below for an intro to RegEx).

Introduction to Regex

Step by step introduction to regular expression using Python

medium.com

def strip_extra_chars(line):
    """
    Strips non-alphanumerical, whitespace and newline characters away from string
    """
    line = line.get_text()
    line = re.sub("\A\S[^a-zA-z0-9]", "", line) # remove any non-whitespace non-alphanumeric character from the beginning of the line
    line = re.sub("[’‘]", "", line).strip() # get rid of extra quotes and remove whitespace with .strip()
    line = re.sub("\n", "", line) # get rid of newlines
    return line.strip()

The function strip_extra_chars will take a line from our data and get rid of all the garbage, including those pesky quotes. See the comments in the function for more details, and try this for a great resource to test and learn RegEx.

Step 3: Iterate over the list

We are almost done. Remember how in the beginning we noticed that our web page only contains 50 songs, and hence we need to iterate over the contents? If we inspect the page again by ctrl-clicking on the navigation menu at the top we find that it points to the URLs we'll need. Here we define a get_links function that will store the URLs in a list.

Then we can easily iterate over the list and retrieve all the data.

def get_links(url):
    options = webdriver.ChromeOptions()
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito')
    options.add_argument('--headless')
    options.add_argument("--start-maximized");
    driver = webdriver.Chrome(options=options)
    # note that in order to get all the urls we need to set the browser's window to max size
    driver.set_window_size(1024, 768) 
    time.sleep(10) # sleep for 10s to allow the page to load
    target_page = url
    driver.get(target_page) 
    header = driver.find_element_by_id('pmc-gallery-list-nav-bar-render')
    links = header.find_elements_by_tag_name('a')
    urls = [link.get_attribute("href") for link in links]
    driver.quit()
    return urlsurls = get_links(target_url)

Now that we have our URLs, the next step is storing our data in lists. For each URL we will launch our get_page_sourcefunction, extract the relevant data, and store it. We do that by storing each page's data into two lists: song_all, which contains the artist and the song's title, and other_all, with data on the writers, producers, and release date. By iterating over them we extract the relevant text, strip it of extra characters and formatting by calling our strip_extra_charsfunction and append the result to three empty lists we initialize earlier: one for the artist, one for the title, and one for the other information (which we will parse right next).

songs = []
artists = []
other_data = []for url in urls:
    print(f"retrieving data from {url}")
    soup = get_page_source(url)
    song_all = soup.find_all('h2', {'class':'c-gallery-vertical-album__title'})
    other_all = soup.find_all(attrs={'class': 'c-gallery-vertical-album__description'})
    for song in song_all:
        song = strip_extra_chars(song)
        title = song.split(',')[1]
        title_inner = title[2:len(title)-1]
        songs.append(title_inner)
        artists.append(song.split(',')[0])
    for other in other_all:
        other = strip_extra_chars(other)
        other_data.append(other)
driver.quit()

Resorting to Regex and a little string slicing will clean and split the data contained in other. We do this in a split_others function, which we then call in a loop to yield three lists of writers, producers, and release dates.

def split_others(line):
    x = "(Writers:|Writer:)"
    y = "(Producer:|Producers:)"
    z = "Released:"
    a = re.split(x, line)
    a = re.split(y, a[2])
    writers = a[0].strip()
    b = re.split(z, a[2])
    producers = b[0]
    released = b[1].split(',')[0]
    released
    return writers, producers, releasedwriters = []
producer = []
release_date = []
for item in other_data:
    w, p, r = split_others(item)
    writers.append(w.strip())
    producer.append(p.strip())
    release_date.append(r[-2:].strip())

Step 4: Putting it all together!

That’s it! We have retrieved our data. We can now use a dictionary and pass our data to pandas in order to store it in a dataframe.

d = {'Artist':artists, 'Song title': songs, 'Writers': writers, 'Producer': producer, 'Year': release_date} 
df = pd.DataFrame(data=d)

(OPTIONAL: Because scraping is quite time consuming, I am going to store the data we retrieved in a pickle — this is a good tutorial. To load it back you need to uncomment the section in triple quotes).


import pickle

filename = 'ROLLING_STONE_DATA'
outfile = open(filename,'wb')
pickle.dump(d,outfile)
outfile.close()
'''
filename = 'ROLLING_STONE_DATA'
infile = open(filename,'rb')
d = pickle.load(infile)
infile.close()
'''df = pd.DataFrame(data=d)

Since we all are curious animals, let’s have a little taste and see who are the artists with most songs and plot the results with matplotlib.

top_10 = df['Artist'].value_counts()[:10]plt.barh(top_10.index, top_10)<BarContainer object of 10 artists>