This post describes how you populate the anomaly database built in Part 1.
Motivation
Continuing the theme of end-to-end reproducible workflows, I want to be able to recreate my raw database programmatically as well.
At the end of this activity, I’m able to quickly load and manage ~6G of data for my personal use.

The entire code-base for populating the database is in my GitHub repo.
Who should read this?
If you’re comfortable using download.file()
, system commands, arff
file formats, and {DBI}
you won’t learn much here. Read on if you’re curious about my approach.
Steps
Only three files here:
- Initial Setup (
[00-execute.R](https://github.com/rsangole/anomaly_db/blob/master/00-execute.R)
) - Download data from the web (
[01-download-data.R](https://github.com/rsangole/anomaly_db/blob/master/01-download-data.R)
) - Load data into
anomaly
database in PostgreSQL ([02-load-data-to-Postgres.R](https://github.com/rsangole/anomaly_db/blob/master/02-load-data-to-postgres.R)
)
Initial Setup
A list helps be keep track of the data sources, and helps me turn any downloads off to save on space/time (the Monash one is a ~2G download, for example).
Some simple housekeeping to ensure directories are setup correctly. Furthermore, if the folder is git controlled, the directory which will house the datasets large_data
needs to be in .gitignore
. I check for this.
Download Data
Now, for those datasets in the list above, simply download the data using download.file()
for the selected datasets and move/unzip them to the large_data
folder. I’m also checking if the folder already exists, and I’d like to overwrite it.
Here’s an example for the UCR dataset. The code for the rest of the datasets is pretty similar.
Load Data
Now, it’s as easy as:
- Connect to the PostgreSQL database using
DBI::dbConnect
- Read a dataset from
large_data/
- Simple cleanup (
janitor::clean_names
, all timestamp cols are calledtime
etc) - Use
DBI::dbWriteTable
to load the data into postgres
Here’s an example code-base:
For the monash dataset, you do need to use [foreign::read.arff()](https://stat.ethz.ch/R-manual/R-devel/library/foreign/html/read.arff.html)
.
Tips
Large file downloads will timeout within the default timeout-window of 1 min. Handle this before calling download.file()
.
Data dictionaries can be stored directly in the DB too. I store the contents of each README.md
in the UCR_Archive2018/*
folder in a data dictionary table called ucr_00_meta
. This allows me to programatically call the dictionary in downstream development.

Shiny can be effectively used for quick exploration. Here’s an example of something I’m building for myself. The dashboard pulls data from PostgreSQL directly. UCR metadata is also pulled from the db rendered at the top of each page, making it quick to browse through the datasets. As I add more datasets, I keep expanding this dashboard.
