Learn To Create Your Own Datasets — Web Scraping in R

Deepal Dsilva
Tech & Career Nuggets
3 min readMay 4, 2018

--

An rvest tutorial

I was recently looking for a dataset to perform sentiment analysis on popular pop song lyrics. I went through a lot of sites providing free datasets but didn’t find any that met my need. This prompted me to create my own dataset.

If you can’t find a way, create one.

Photo by Farzad Nazifi on Unsplash

What is Web Scraping?

There is a lot of data on websites, but not always will you find a way to download this data. Web scraping is a process of extracting unstructured data from websites into a structured format so that you can perform further analysis on it.

What is the rvest package?

rvest is a R package created by Hadley Wickham to scrape information from web pages. It is mainly inspired from the popular Python library beautiful soup.

In this example I followed a two part process to get the lyrics of the most popular songs from the top 10 artists:

  1. I used rvest to extract the Top 10 Pop Artists of All Time from billboard.com .
  2. Then I used these artists to extract their popular songs and lyrics from genius.com .

Follow me on a step by step walk-through

First of course you will need to install and load the following packages.

library(tidyverse)
library(rvest)

PART 1 : Extracting the Top 10 Pop Artists of All Time — Source: www.billboard.com

  • Identify the url from which you want to extract data. Then use the read_html() function to create an html document from the url.
  • Next you need to identify the CSS selector which points to the data you want to extract. It’s helpful if to have a little knowledge on HTML and CSS. Else you can use the handy Chrome extension SelectorGadget to find the CSS selector. The easiest way I found is to right-click on any page element in Chrome and select Inspect Element.
  • You can then use the html_nodes() function with the CSS selector to extract the data you want.
  • Then all you need is to save your results into a data frame. I’ve used tibbles here to store the data as it is a little easier to work with them than data frames. You can read more about the difference between data frames and tibbles here.

PART 2 : Extracting Popular Songs and Lyrics of the top 10 Artists — Source: www.genius.com

Now that you have the Top 10 Pop Artists, you can use the genius.com website to identify the most popular songs and extract their lyrics.

  • First identify the url to the artist’s webpage. A little bit of research on the website showed that all the webpages followed the below format. https://genius.com/artists/<artistname>
  • Then use a nested for loop to extract the songs and their lyrics. Here again, I used SelectorGadget to identify the right CSS selector for the job.
  • Next store the results into a tibble.
  • And finally, don’t forget to include a random sleep interval between each loop to prevent you from getting booted from the website.

And here’s a snapshot of what your dataset will look like.

artist_lyrics tibble

That’s all you need to know to create your own dataset. Thus giving you endless possibilities to experiment with data you want.

Hope you enjoyed this tutorial and are now inspired to create your very own dataset.

Thanks for reading!

Follow me on instagram at for my weekly learning progress and study resources I use.

--

--

Deepal Dsilva
Tech & Career Nuggets

Demo Engineer at Salesforce | Data Analyst | Always learning!