Scrape Data and Build a Webapp in R Using Rvest and Shiny

Brad Lindblad
Towards Data Science
11 min readApr 29, 2018

--

Photo by Patrick Fore on Unsplash

Sharing your analyses and data science findings via traditional methods is cool and all, but what if you wanted to share with a larger audience? What if you want to share an analysis of some data set in real-time? With a few dozen lines of code, you can create such a tool in an afternoon.

I’m going to share how I built an almost-real-time webapp using the R computing language and a few key packages. You will learn how to scrape a website, parse that data, and create a webapp that anyone with a browser can visit.

Problem Statement

I need to display data from https://coinmarketcap.com/gainers-losers/ in a way that lets me easily see which coins are on a real heater that day.

Tools

We will utilize the Rvest package, shiny, shinydashboard, and various tidyverse tools, all in the Rstudio IDE.

While RSelenium is a popular and viable tool for web scraping (collecting data from websites by parsing HTML), the Rvest package is arguably a neater and cleaner tool for the job.

The basic workflow of the script is as follows:

  1. Create a new Rstudio project
  2. Create a single “app.R” shiny file
  3. Call libraries
  4. Build functions
  5. Build UI
  6. Build Server
  7. Deploy

So let’s get started.

Call libraries

We will call the libraries discussed in the “Tools” section above:

library(shiny)
library(tidyverse)
library(shinydashboard)
library(rvest)

Build functions

Always cognizant of the DRY (don’t repeat yourself) maxim of coding, we will build simple functions to scrape and wrangle our data, so we don’t have to duplicate code.

#####################
####### F N S #######
#####################
get.data <- function(x){myurl <- read_html("https://coinmarketcap.com/gainers-losers/") # read our webpage as htmlmyurl <- html_table(myurl) # convert to an html table for ease of use


to.parse <- myurl[[1]] # pull the first item in the list
to.parse$`% 1h` <- gsub("%","",to.parse$`% 1h`) # cleanup - remove non-characters
to.parse$`% 1h`<- as.numeric(to.parse$`% 1h`) #cleanup - convert percentages column to numeric so we can sort
to.parse$Symbol <- as.factor(to.parse$Symbol) # cleanup - convert coin symbol to factor

to.parse$Symbol <- factor(to.parse$Symbol,
levels = to.parse$Symbol[order(to.parse$'% 1h')]) # sort by gain value
to.parse # return the finished data.frame
}

The get.data function scrapes our website and returns a data frame that is useful to us.

We use the read_html and html_table functions from the Rvest package to read in the web page data and format it for easy wrangling. Next, we pull the first of many tables from that webpage and clean it up with basic R functions. Finally, we sort the new data frame by the percent-gain value.

If you were to call that function in the R console with

get.data()

you would return a data frame that looks something like this:

get.data() returns this data.frame

As you can see, this shows us which coins have had the biggest gains in the past hour, sorted by percent gain. This data is the basis forthe entire dashboard.

The next function we will build is called get.infobox.val. This simply pulls the highest value from the data frame above.

get.infobox.val <- function(x){

df1 <- get.data() # run the scraping function above and assign that data.frame to a variable
df1 <- df1$`% 1h`[1] # assign the first value of the % gain column to same variable
df1 # return value

}

The last function is called get.infobox.coin, which returns the name of the top coin.

get.infobox.val <- function(x){

df1 <- get.data() # run the scraping function above and assign that data.frame to a variable
df1 <- df1$`% 1h`[1] # assign the first value of the % gain column to same variable
df1 # return value

}

Now that we have our functions built, time to build the shiny dashboard that will display our data to the world.

Enter Shiny Webapps

A shiny webapp will allow us to build an interactive dashboard that we will let Rstudio host for us with their servers. They have a free plan so anyone can easily get started with it. From the Rstudio documentation:

Shiny applications have two components, a user interface object and a server function, that are passed as arguments to the shinyApp function that creates a Shiny app object from this UI/server pair.

At this point I highly recommend skimming through the official documentation referenced above to get yourself familiar with basic Shiny concepts.

Without getting too much in the weeds, let’s begin by building the UI object.

UI

In the UI object we will lay out our dashboard. When I was building this I went off a pencil wireframe sketch. To help you visualize the end result here is a screenshot of the finished product:

Our finished dashboard for reference

The shinydashboard structure allows for us to have a sidebar and a dashboard area where we can add boxes in rows and columns. The black area above is the sidebar, and everything to the right of that is the body. Now for the UI code.

ui <- dashboardPage(


# H E A D E R

dashboardHeader(title = "Alt Coin Gainers"),

Going down the line, we assign the dashboardPage function to the UI and then add the parts that we need.

# S I D E B A R

dashboardSidebar(

h5("A slightly interactive dashboard that pulls the top gainers from the last hour from
coinmarketcap.com. Refreshes every 60 seconds."),

br(),
br(),
br(),
br(),
br(),
br(),
br(),
br(),
br(),
br(),

h6("Built by Brad Lindblad in the R computing language
[ R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. URL https://www.R-project.org/]"),
br(),
h6("R version 3.4.4 (2018-03-15) 'Someone to Lean On'"),
br(),
a("bradley.lindblad@gmail.com", href="bradley.lindblad@gmail.com")

),

The sidebar is a good place to put documentation or filters (which we don’t cover in this tutorial). You’ll notice that you can pass certain html tags and classes to your text. H5 is just a tag that defines the text as the 5th-level heading in a document, which is usually the 5th-largest text.

# B O D Y
dashboardBody(

fluidRow(

# InfoBox
infoBoxOutput("top.coin",
width = 3),

# InfoBox
infoBoxOutput("top.name",
width = 3)

),

fluidRow(
column(
# Datatable
box(
status = "primary",
headerPanel("Data Table"),
solidHeader = T,
br(),
DT::dataTableOutput("table", height = "350px"),
width = 6,
height = "560px"
),

# Chart
box(
status = "primary",
headerPanel("Chart"),
solidHeader = T,
br(),
plotOutput("plot", height = "400px"),
width = 6,
height = "500px"
),
width = 12
)

)
)

)

In the body section we build out the main items in the dashboard. There are a few key parts of the body section that actually output the data.

Infobox

# InfoBox
infoBoxOutput("top.coin",
width = 3),

# InfoBox
infoBoxOutput("top.name",
width = 3)

These two code chunks output the two purple boxes at the top of our dashboard. The two output ID’s, “top.coin” and “top.name” are referencing data that is being output in the server function that we will go over later.

Data Table and Plot

# Datatable
box(
status = "primary",
headerPanel("Data Table"),
solidHeader = T,
br(),
DT::dataTableOutput("table", height = "350px"),
width = 6,
height = "560px"
),
# Chart
box(
status = "primary",
headerPanel("Chart"),
solidHeader = T,
br(),
plotOutput("plot", height = "400px"),
width = 6,
height = "500px"
),

Same with the data table and plot. Table is the output ID that will tie to a server function below, and plot also ties to the server.

SERVER

Next on the agenda is defining our server function. The server function is where values are computed and then read and plotted or displayed by the UI function.

#####################
#### S E R V E R ####
#####################
server <- function(input, output) {# R E A C T I V E
liveish_data <- reactive({
invalidateLater(60000) # refresh the report every 60k milliseconds (60 seconds)
get.data() # call our function from above
})


live.infobox.val <- reactive({
invalidateLater(60000) # refresh the report every 60k milliseconds (60 seconds)
get.infobox.val() # call our function from above
})


live.infobox.coin <- reactive({
invalidateLater(60000) # refresh the report every 60k milliseconds (60 seconds)
get.infobox.coin() # call our function from above
})

Remember the functions we defined at the beginning? We’re going to use them now. We’re also going to add another concept: reactive expressions. These allow our shiny dashboard to update at regular intervals or based on user input. For this dashboard, we are asking the program to run our functions every 60 seconds, which updates the dashboard with the most recent values from the website we are scraping.

Data Table

# D A T A   T A B L E   O U T P U T
output$table <- DT::renderDataTable(DT::datatable({
data <- liveish_data()}))

Remember the output ID table that we defined above? We’re going to reference that in the function above. Notice that we used the liveish_data reactive function and not our original function from the very beginning.

Plot

# P L O T   O U T P U T
output$plot <- renderPlot({ (ggplot(data=liveish_data(), aes(x=Symbol, y=`% 1h`)) +
geom_bar(stat="identity", fill = "springgreen3") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Gainers from the Last Hour"))
})

Next, we plot a simple barplot using ggplot2.

Infoboxes

# I N F O B O X   O U T P U T - V A L
output$top.coin <- renderInfoBox({
infoBox(
"Gain in Last Hour",
paste0(live.infobox.val(), "%"),
icon = icon("signal"),
color = "purple",
fill = TRUE)
})


# I N F O B O X O U T P U T - N A M E
output$top.name <- renderInfoBox({
infoBox(
"Coin Name",
live.infobox.coin(),
icon = icon("bitcoin"),
color = "purple",
fill = TRUE)
})

}

Finally, we plot our two infoboxes using the reactive expressions defined in the beginning of the server function.

DEPLOY

The very last part of the script combines our UI with our server and deploys the app.

#####################
#### D E P L O Y ####
#####################
# Return a Shiny app objectshinyApp(ui = ui, server = server)
shinyApp(ui = ui, server = server)

At this point, you should have a running app appear in a window in Rstudio.

Hit the open in browser button to see an un-jumbled version of the app; the actual version you will see when it’s deployed.

FINISHING UP

At this point, all the heavy lifting is done; we built a scraper that downloads and parses html from a website, then we built a shiny app using the shinydashboard framework to display our analysis for the world to see.

At this point, all that’s left is to deploy the app, which Shiny makes extremely easy for us. Simply click the deploy button that appears in the Rstudio preview and follow the instructions. All in all, it should take less than 10 minutes to deploy from that point.

I hope this little tutorial helps you in your R journey, and gives you some ideas as to what cool things you can build with this powerful open-source language.

If you have any questions or just want to chat about R, email me at bradley.lindblad@gmail.com.

Appendix

Here is the full script, which you can also clone from my Github page for this project.

library(shiny)
library(tidyverse)
library(shinydashboard)
library(rvest)
#####################
####### F N S #######
#####################
get.data <- function(x){myurl <- read_html("https://coinmarketcap.com/gainers-losers/") # read our webpage as html
myurl <- html_table(myurl) # convert to an html table for ease of use


to.parse <- myurl[[1]] # pull the first item in the list
to.parse$`% 1h` <- gsub("%","",to.parse$`% 1h`) # cleanup - remove non-characters
to.parse$`% 1h`<- as.numeric(to.parse$`% 1h`) #cleanup - convert percentages column to numeric
to.parse$Symbol <- as.factor(to.parse$Symbol) # cleanup - convert coin symbol to factor

to.parse$Symbol <- factor(to.parse$Symbol,
levels = to.parse$Symbol[order(to.parse$'% 1h')]) # sort by gain value
to.parse # return the finished data.frame
}
get.infobox.val <- function(x){

df1 <- get.data() # run the scraping function above and assign that data.frame to a variable
df1 <- df1$`% 1h`[1] # assign the first value of the % gain column to same variable
df1 # return value

}
get.infobox.coin <- function(x){

df <- get.data() # run the scraping function above and assign that data.frame to a variable
df <- df$Name[1] # assign the first value of the name column to same variable
df # return value

}
#####################
####### U I #########
#####################
ui <- dashboardPage(


# H E A D E R

dashboardHeader(title = "Alt Coin Gainers"),

# S I D E B A R

dashboardSidebar(

h5("A slightly interactive dashboard that pulls the top gainers from the last hour from
coinmarketcap.com. Refreshes every 60 seconds."),

br(),
br(),
br(),
br(),
br(),
br(),
br(),
br(),
br(),
br(),

h6("Built by Brad Lindblad in the R computing language
[ R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. URL https://www.R-project.org/]"),
br(),
h6("R version 3.4.4 (2018-03-15) 'Someone to Lean On'"),
br(),
a("bradley.lindblad@gmail.com", href="bradley.lindblad@gmail.com")

),

# B O D Y
dashboardBody(

fluidRow(

# InfoBox
infoBoxOutput("top.coin",
width = 3),

# InfoBox
infoBoxOutput("top.name",
width = 3)

),

fluidRow(
column(
# Datatable
box(
status = "primary",
headerPanel("Data Table"),
solidHeader = T,
br(),
DT::dataTableOutput("table", height = "350px"),
width = 6,
height = "560px"
),

# Chart
box(
status = "primary",
headerPanel("Chart"),
solidHeader = T,
br(),
plotOutput("plot", height = "400px"),
width = 6,
height = "500px"
),
width = 12
)

)
)

)
#####################
#### S E R V E R ####
#####################
server <- function(input, output) {# R E A C T I V E
liveish_data <- reactive({
invalidateLater(60000) # refresh the report every 60k milliseconds (60 seconds)
get.data() # call our function from above
})


live.infobox.val <- reactive({
invalidateLater(60000) # refresh the report every 60k milliseconds (60 seconds)
get.infobox.val() # call our function from above
})


live.infobox.coin <- reactive({
invalidateLater(60000) # refresh the report every 60k milliseconds (60 seconds)
get.infobox.coin() # call our function from above
})

# D A T A T A B L E O U T P U T
output$table <- DT::renderDataTable(DT::datatable({
data <- liveish_data()}))


# P L O T O U T P U T
output$plot <- renderPlot({ (ggplot(data=liveish_data(), aes(x=Symbol, y=`% 1h`)) +
geom_bar(stat="identity", fill = "springgreen3") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Gainers from the Last Hour"))
})



# I N F O B O X O U T P U T - V A L
output$top.coin <- renderInfoBox({
infoBox(
"Gain in Last Hour",
paste0(live.infobox.val(), "%"),
icon = icon("signal"),
color = "purple",
fill = TRUE)
})


# I N F O B O X O U T P U T - N A M E
output$top.name <- renderInfoBox({
infoBox(
"Coin Name",
live.infobox.coin(),
icon = icon("bitcoin"),
color = "purple",
fill = TRUE)
})

}
#####################
#### D E P L O Y ####
#####################
# Return a Shiny app objectshinyApp(ui = ui, server = server)
shinyApp(ui = ui, server = server)

--

--