The world’s leading publication for data science, AI, and ML professionals.

Making the Anomaly Database

This is part two of the two part post related to Docker, Postgres databases and Anomaly data-sets. Read Part 1, which teaches you how to…

This post describes how you populate the anomaly database built in Part 1.

Motivation

Continuing the theme of end-to-end reproducible workflows, I want to be able to recreate my raw database programmatically as well.

At the end of this activity, I’m able to quickly load and manage ~6G of data for my personal use.

Lots of data, quickly explore-able using VSCode. Source: author.
Lots of data, quickly explore-able using VSCode. Source: author.

The entire code-base for populating the database is in my GitHub repo.

Who should read this?

If you’re comfortable using download.file(), system commands, arff file formats, and {DBI} you won’t learn much here. Read on if you’re curious about my approach.

Steps

Only three files here:

  1. Initial Setup ([00-execute.R](https://github.com/rsangole/anomaly_db/blob/master/00-execute.R))
  2. Download data from the web ([01-download-data.R](https://github.com/rsangole/anomaly_db/blob/master/01-download-data.R))
  3. Load data into anomaly database in PostgreSQL ([02-load-data-to-Postgres.R](https://github.com/rsangole/anomaly_db/blob/master/02-load-data-to-postgres.R))

Initial Setup

A list helps be keep track of the data sources, and helps me turn any downloads off to save on space/time (the Monash one is a ~2G download, for example).

Some simple housekeeping to ensure directories are setup correctly. Furthermore, if the folder is git controlled, the directory which will house the datasets large_data needs to be in .gitignore. I check for this.

Download Data

Now, for those datasets in the list above, simply download the data using download.file() for the selected datasets and move/unzip them to the large_data folder. I’m also checking if the folder already exists, and I’d like to overwrite it.

Here’s an example for the UCR dataset. The code for the rest of the datasets is pretty similar.

Load Data

Now, it’s as easy as:

  1. Connect to the PostgreSQL database using DBI::dbConnect
  2. Read a dataset from large_data/
  3. Simple cleanup (janitor::clean_names, all timestamp cols are called time etc)
  4. Use DBI::dbWriteTable to load the data into postgres

Here’s an example code-base:

For the monash dataset, you do need to use [foreign::read.arff()](https://stat.ethz.ch/R-manual/R-devel/library/foreign/html/read.arff.html).

Tips

Large file downloads will timeout within the default timeout-window of 1 min. Handle this before calling download.file().

Data dictionaries can be stored directly in the DB too. I store the contents of each README.md in the UCR_Archive2018/* folder in a data dictionary table called ucr_00_meta. This allows me to programatically call the dictionary in downstream development.

Data dictionary for UCR dataset. Source: author.
Data dictionary for UCR dataset. Source: author.

Shiny can be effectively used for quick exploration. Here’s an example of something I’m building for myself. The dashboard pulls data from PostgreSQL directly. UCR metadata is also pulled from the db rendered at the top of each page, making it quick to browse through the datasets. As I add more datasets, I keep expanding this dashboard.

Shiny dashboard to quickly visualize data. Source: author.
Shiny dashboard to quickly visualize data. Source: author.

Related Articles