Learn To Create Your Own Datasets — Web Scraping in R

Published in

Tech & Career Nuggets

3 min readMay 4, 2018

An rvest tutorial

I was recently looking for a dataset to perform sentiment analysis on popular pop song lyrics. I went through a lot of sites providing free datasets but didn’t find any that met my need. This prompted me to create my own dataset.

If you can’t find a way, create one.

What is Web Scraping?

There is a lot of data on websites, but not always will you find a way to download this data. Web scraping is a process of extracting unstructured data from websites into a structured format so that you can perform further analysis on it.

What is the rvest package?

rvest is a R package created by Hadley Wickham to scrape information from web pages. It is mainly inspired from the popular Python library beautiful soup.

In this example I followed a two part process to get the lyrics of the most popular songs from the top 10 artists:

I used rvest to extract the Top 10 Pop Artists of All Time from billboard.com .
Then I used these artists to extract their popular songs and lyrics from genius.com .

Follow me on a step by step walk-through

First of course you will need to install and load the following packages.

library(tidyverse)
library(rvest)

PART 1 : Extracting the Top 10 Pop Artists of All Time — Source: www.billboard.com

Identify the url from which you want to extract data. Then use the read_html() function to create an html document from the url.
Next you need to identify the CSS selector which points to the data you want to extract. It’s helpful if to have a little knowledge on HTML and CSS. Else you can use the handy Chrome extension SelectorGadget to find the CSS selector. The easiest way I found is to right-click on any page element in Chrome and select Inspect Element.
You can then use the html_nodes() function with the CSS selector to extract the data you want.
Then all you need is to save your results into a data frame. I’ve used tibbles here to store the data as it is a little easier to work with them than data frames. You can read more about the difference between data frames and tibbles here.

PART 2 : Extracting Popular Songs and Lyrics of the top 10 Artists — Source: www.genius.com

Now that you have the Top 10 Pop Artists, you can use the genius.com website to identify the most popular songs and extract their lyrics.

First identify the url to the artist’s webpage. A little bit of research on the website showed that all the webpages followed the below format. https://genius.com/artists/<artistname>
Then use a nested for loop to extract the songs and their lyrics. Here again, I used SelectorGadget to identify the right CSS selector for the job.
Next store the results into a tibble.
And finally, don’t forget to include a random sleep interval between each loop to prevent you from getting booted from the website.

And here’s a snapshot of what your dataset will look like.

That’s all you need to know to create your own dataset. Thus giving you endless possibilities to experiment with data you want.

Hope you enjoyed this tutorial and are now inspired to create your very own dataset.

Thanks for reading!

Follow me on instagram at for my weekly learning progress and study resources I use.

Learn To Create Your Own Datasets — Web Scraping in R

An rvest tutorial

What is Web Scraping?

What is the rvest package?

Follow me on a step by step walk-through

Written by Deepal Dsilva