
In this tutorial, I will be creating an automated, interactive dashboard of Texas COVID-19 case count by county using python with the help of selenium, pandas, dash, and plotly. I am assuming the reader has some familiarity with python, pandas, and selenium. That’s all you’ll need to create amazing interactive dashboards after reading this tutorial!
A general overview of the steps and what you will learn is as follows:
- Downloading daily updated data from the web using selenium
- Updating data directories using shutil, glob, and os python libraries
- Simple cleaning of excel files with pandas
- Formatting time series data frames to be input into plotly graphs
- Creating a local web page for your dashboard using dash
Before we get started, you will need to download the python libraries that will be used in this tutorial. This can be done with the following command on your terminal:
pip install <package name>
# package names: selenium, pandas, webdriver_manager, shutil, glob, plotly,and dash
One thing to note is that I used Jupyter Notebooks for steps 1–4 and then Spyder for step 5. When I am explaining what a code block does it is usually the one right above the text. All of my code for this tutorial can be found on my GitHub under the Automate collecting of data notebooks file.
Ok, lets get started!
Step 1: Downloading daily updated data from the web using selenium
# import packages
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import shutil
import glob
import os
from unicodedata import *
import time
# open a chrome browser using selenium
driver = webdriver.Chrome(ChromeDriverManager().install())
# got to web page where excel file links are located
driver.get("https://www.dshs.texas.gov/coronavirus/additionaldata/")
# these options allow selenium to download files
options = Options()
options.add_experimental_option("browser.download.folderList",2)
options.add_experimental_option("browser.download.manager.showWhenStarting", False)
options.add_experimental_option("browser.helperApps.neverAsk.saveToDisk", "application/octet-stream,application/vnd.ms-excel")
So this first code block is pretty self explanatory but some things to note. I am installing a selenium chrome web driver, so I don’t have to always re-download the new chrome driver when it gets updated. This driver is used to download the excel file I want online. It will open up a new google chrome window on your computer and go to the web page that provides the daily updated Texas COVID-19 case counts for each county. The standard options of selenium do not allow for downloading files from the web so these options need to be changed to allow that.
# initialize an object to the location on the web page and click on it to download
link = driver.find_element_by_xpath('/html/body/form/div[4]/div/div[3]/div[2]/div/div/ul[1]/li[1]/a')
link.click()
# Wait for 15 seconds to allow chrome to download file
time.sleep(15)
This block of code clicks on the link to the excel file using a full Xpath. This can be found by right clicking on a page, inspecting it, and right clicking on the link you want to click on, and copying the full Xpath. Then the driver clicks on the link and waits 15 seconds for it to be downloaded which is crucial for the next code block to not give you an error.
2. Updating data directories using shutil, glob, and os python libraries
Before reading onward: If you are following at home, you will need to change the paths that I use for both the download directory from ‘/Users/tsbloxsom/Downloads/.xlsx’ to ‘/Users//Downloads/.xlsx’
You will also need to change the directory where you want to store your data and python notebooks in from what I use: ‘/Users/tsbloxsom/Documents/GitHub/Texas-census-county-data-project/Automate collecting of data notebooks/.xlsx’ to ‘/Users//bla/bla/.xlsx’
# locating most recent .xlsx downloaded file
list_of_files = glob.glob('/Users/tsbloxsom/Downloads/*.xlsx')
latest_file = max(list_of_files, key=os.path.getmtime)
# replace "" with "/" so file path can be located by python
latest_file = latest_file.replace("", "/")
latest_file
# we need to locate the old .xlsx file(s) in the dir we want to store the new xlsx file in
list_of_files = glob.glob('/Users/tsbloxsom/Documents/GitHub/Texas-census-county-data-project/Automate collecting of data notebooks/*.xlsx')
# need to delete old xlsx file(s) so if we download new xlsx file with same name we do not get an error while moving it
for file in list_of_files:
print("deleting old xlsx file:", file)
os.remove(file)
# Move the new file from the download dir to the github dir
shutil.move(latest_file,'/Users/tsbloxsom/Documents/GitHub/Texas-census-county-data-project/Automate collecting of data notebooks/')
This code block is where a lot of the automation happens. It finds the excel file I just downloaded using the glob and os.path.getmtime functions, deletes any old excel file(s) (maybe from the previous day) in the directory I want to store the new excel file in, then moves the new excel file into the GitHub directory. The os.path.getmtime() function returns the time of last modification of the path, so with the max() function, you can find the most recently downloaded excel file.
Step 3: Simple cleaning of excel files with pandas
Now that we have the up-to-date COVID-19 data in the directory we want, there is some cleaning we need to do before it can be plotted.
import pandas as pd
import re
pd.set_option('display.max_rows', 500)
pd.options.display.max_colwidth = 150
# again we need to locate the .xlsx file
list_of_files = glob.glob('/Users/tsbloxsom/Documents/GitHub/Texas-census-county-data-project/Automate collecting of data notebooks/*.xlsx')
latest_file = max(list_of_files, key=os.path.getctime)
print(latest_file.split("")[-1])
df = pd.read_excel("{}".format(latest_file),header=None)
df.head()

So we read the excel file and convert it to a pandas data frame (df)… and we got some cleaning to do. Lets deal with the first and last few rows of the file.
# print out latest COVID data datetime and notes
date = re.findall("- [0-9]+/[0-9]+/[0-9]+ .+", df.iloc[0, 0])
print("COVID cases latest update:", date[0][2:])
print(df.iloc[1, 0])
print(str(df.iloc[262:266, 0]).lstrip().rstrip())
#drop non-data rows
df2 = df.drop([0, 1, 258, 260, 261, 262, 263, 264, 265, 266, 267])
First, I want to print out relevant information about the data which is in the first two rows and last few rows, then I want to drop these rows from the df.
# clean column names
df2.iloc[0,:] = df2.iloc[0,:].apply(lambda x: x.replace("r", ""))
df2.iloc[0,:] = df2.iloc[0,:].apply(lambda x: x.replace("n", ""))
df2.columns = df2.iloc[0]
clean_df = df2.drop(df2.index[0])
clean_df = clean_df.set_index("County Name")
# convert clean_df to a .csv file
clean_df.to_csv("Texas county COVID cases data clean.csv")

In the above code block, I clean the row for each date where many of these dates had new line, , and carriage return, , characters inside the cells. I then make the date row the column names and drop the original date row. We finally have a clean df where the rows are each county in Texas and each column represents number of cases for each date. Finally, I want to convert the df to a .csv file for the next step.
Step 4: Formatting time series data frames to be input into plotly graphs
import plotly.express as px
list_of_files = glob.glob('/Users/tsbloxsom/Documents/GitHub/Texas-census-county-data-project/Automate collecting of data notebooks/*.csv')
latest_file = max(list_of_files, key=os.path.getmtime)
latest_file.split("")[-1]
df = pd.read_csv(latest_file.split("")[-1])
Again we are using the glob and os.path.getmtime functions to find the most recent .csv file which is our "Texas county COVID cases data clean.csv".
# convert df into time series where rows are each date and clean up
df_t = df.T
df_t.columns = df_t.iloc[0]
df_t = df_t.iloc[1:]
df_t = df_t.iloc[:,:-2]
# next lets convert the index to a date time, must clean up dates first
def clean_index(s):
s = s.replace("*","")
s = s[-5:]
s = s + "-2020"
#print(s)
return s
df_t.index = df_t.index.map(clean_index)
df_t.index = pd.to_datetime(df_t.index)

In the above code block, I am converting the data frame into a time series df where the rows are now the dates and the columns are counties. I am also dropping a couple columns that have total cases and number of counties reporting cases. Finally, I am converting the date column into a usable format that can be converted to pandas date_time object which I use the function clean_index for. But we are not done yet.
# initalize df with three columns: Date, Case Count, and County
anderson = df_t.T.iloc[0,:]
ts = anderson.to_frame().reset_index()
ts["County"] = "Anderson"
ts = ts.rename(columns = {"Anderson": "Case Count", "index": "Date"})
Our plotly function will take 3 columns as input: Date, Case Count, and County. So we need to convert our wide df into a long df essentially. To do this I first initialize a long df with three columns: a date column, the case count for Anderson County, and a county column with the name of the county in it. Which looks like this:

# This while loop adds all counties to the above ts so we can input it into plotly
x = 1
while x < 254:
new_ts = df_t.T.iloc[x,:]
new_ts = new_ts.to_frame().reset_index()
new_ts["County"] = new_ts.columns[1]
new_ts = new_ts.rename(columns = {new_ts.columns[1]: "Case Count", "index": "Date"})
ts = pd.concat([ts, new_ts])
x += 1
#save long form df for dash app
ts.to_csv("time_series_plotly.csv")
The comment of the above code block says it all but we now just loop through the entire wide df and create new_ts long df’s for each county that look like the one for Anderson. Then we concat each long df together and wallah we have ourselves the long df that can be input into plotly and later our dash app show below!

fig = px.scatter(ts, x='Date', y='Case Count', color='County')
fig.update_traces(mode='markers+lines')
#fig.update_traces(mode='lines')
fig.show()

We can finally plot our data as a plotly scatter plot using just three simple lines of code! I used the markers+lines option which I think looks more pleasing and is easier to interact with than just the lines. I have fallen in love with plotly because unlike seaborn or matplotlib, the graphs are interactive and soooooo simple to code. Plotly also has outstanding documentation which can be found here.
Step 5: Creating a local web page for your dashboard using dash
This is the step that might get tricky for some people if they do not have any software development experience. I didn’t have any about a month ago so I will got through this step slowly and keep it as simple as possible. First we want to create a virtual environment in the directory we have our notebooks/data. To do this we must go to our command line or Anaconda prompt and cd to that directory. Something like this:
C:Userstsbloxsom>cd C:UserstsbloxsomDocumentsGitHubTexas-census-county-data-projectAutomate collecting of data notebooks
Then we create the virtual environment:
python3 -m venv venv
Then we activate the virtual environment:
venvScriptsactivate
Lets get to the dash program now. Here is a link for anyone new to dash. Dash is made by the people from plotly so you get great documentation and great products. I learned how to make this app for the time series dashboard in under 2 hours, not trying to brag, its just that simple. Below is the code for the app. I used a combination of dash’s basic app templates found in this tutorial on their website. So open up your favorite code editor like Spyder and save the below code as app.py in the same directory as your new venv folder and .csv file(s).
import dash
import dash_core_components as dcc
import dash_html_components as html
import plotly.express as px
import pandas as pd
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
colors = {
'background': '#F0F8FF',
'text': '#00008B'
}
# assume you have a "long-form" data frame
# see https://plotly.com/python/px-arguments/ for more options
df = pd.read_csv("time_series_plotly.csv")
fig = px.scatter(df, x='Date', y='Case Count', color='County')
fig.update_layout(
plot_bgcolor=colors['background'],
paper_bgcolor=colors['background'],
font_color=colors['text']
)
markdown_text = '''
### Texas COVID-19 Dashboard
Creator: Truett Bloxsom, [LinkedIn](https://www.linkedin.com/in/truett-bloxsom/), [github](https://github.com/tsbloxsom)
This is my first interactive dashboard using Dash! Hope you like it!
This first plot is Texas COVID-19 accumulated cases by county over time
Source for data: [dshs.texas.gov](https://www.dshs.texas.gov/coronavirus/additionaldata/)
'''
app.layout = html.Div([
dcc.Markdown(children=markdown_text,
style={
'backgroundColor': colors['background'],
'textAlign': 'center',
'color': colors['text']
}),
dcc.Graph(
id='example-graph',
figure=fig
)
])
if __name__ == '__main__':
app.run_server(debug=True)
Now go back to the activated environment and download dash, plotly, and pandas:
pip install <package_name>
Then you can run your app by running this command:
python3 app.py
It should look something like this:

Now just copy the http link and paste it into chrome and boom! It should look like this:

I went a step further and followed this tutorial by Elsa Scola on how to deploy your dash app for free. If you just want to play around with my interactive dashboard you can here!
Thanks for following my tutorial and feel free to connect with me on linkedin. I am always trying to get feedback and strengthen my Data Science skills, so if there was anything I could improve coding wise or explain something in greater detail please let me know in the comments.