Background
In recent LinkedIn posts ( original and Rami‘s repost) and tweet, I asked the internet for their favorite datasets for anomaly detection problems, particularly in the time-series domain. I got lots of responses, and now have a massive amount of data to play with, thank you folks who responded.
To play with these data, I wanted a better solution than keeping data as CSV files, qs
objects, or in R Packages. I wanted a database to store the raw input data, my processed data and algorithm results. While I’ve setup databases in the past the traditional way, this time I wanted the entire codebase to be portable & reproducible.
As a result, I setup everything using Docker and it works seemlessly. If you’d like to learn how to do so, follow along my next posts:
- Part I – (this post) Will teach you how to setup a simple reproducible Docker based workflow for a personal PostgreSQL Database + RStudio for code development
- Part II – (next post) Will be the ETL pipelines for anomaly data
Why should you read this?
At the end of this tutorial, you’ll be able to rapidly setup a Docker based personal¹ PostgreSQL database. You will learn how to quickly deploy PostgreSQL & RStudio using docker-compose
. You will be able to access the database in R and begin development immediately. And most importantly, the whole process will be fully reproducible as you inherit the benefits of setup scripts and Docker.
This tutorial assumes you’re familiar with Docker and RStudio in a Docker environment. If you’re not, I recommend reading Reproducible Work in R first.
Overview
You will be launching two Docker images:
- A PostgreSQL image. I choose
[postgres:13.3](https://hub.docker.com/_/postgres/)
- An Rstudio image. I choose
[hatmatrix/blog:base](https://hub.docker.com/r/hatmatrix/blog/tags?page=1&ordering=last_updated)
²
To permanently store your data beyond the life of the containers, you will mount two volumes, one for each container. I chose:
- For PostgreSQL:
$HOME/docker/volumes/postgres
- For R Projects:
$HOME/github

None of these paths, except for the in-container PostgreSQL are special; you can customize the others to your liking. By default, postgres:13.3
expects the database to be at /var/lib/postgresql/data
. If you choose another database, modify this accordingly.
I use docker-compose
to launch both PostgreSQL and RStudio services together. It’s convenient while also ensuring the PostgreSQL service runs first followed by RStudio. It’s easy to start or stop all the services using just a few commands.
First Time Setup
You need to run these steps the first time you’re setting up the PostgreSQL database. I’ve stored these steps in 00-postgres-init.sh
.
1 – Directory Setup
You need a local directory to store the PostgreSQL database in. Lines 3–10 take care of this for you.
2 – PostgreSQL Setup
Now it’s time to setup the database. You need two steps at a minimum to get started:
- A new ‘role’ (akin to a login) with rights to create new databases.
- At least one database to work in. In my script, I’m making two:
work
andanomaly
.
To manipulate the database, you need a PostgreSQL server running to process the psql
commands. You’ll launch one using docker run
. You need the correct volume mounted using -v
. Next, we create the role and databases by piping psql
commands into docker exec
. Then, we stop the container
In summary, now I have a PostgresSQL database:
- stored at
$HOME/docker/volumes/postgres
- with a new role
rahul
and passwordpass
- with 2 databases:
work
andanomaly
Daily Workflow
tldr: How do you get going?
- Store
[docker-compose.yml](https://github.com/rsangole/postgres/blob/master/docker-compose.yml)
in a local directory - Modify it if you’ve changed my chosen images/directories
- In shell, run
docker-compose up -d
protip: to launch a browser (firefox for me) directly into RStudio as well, run this command in the directory where you have docker-compose.yml
:
docker-compose up -d; firefox localhost:8787
pro-protip: save an alias and generalize the command. The -f
arg instructs docker-compose
which file you’d like to use. Now can be run from anywhere in the system.
The breakdown
What’s in [docker-compose.yml](https://github.com/rsangole/postgres/blob/master/docker-compose.yml)
? We’re creating two services, one called db
and the other rstudio
.⁴
Let’s look at db
first. Most of the arguments will look familiar if you’re familiar with docker run
args. What’s new here is the restart: unless-stopped
arg which tells Docker to only start PostgreSQL if it’s currently stopped.
The 2nd service is rstudio
. Apart from the typical args, the interesting arg here is depends_on
which tells Docker to only run this image after the database is up and running. Fantastic!
Connecting via R
Use to test your connection. Run your DBI::
commands you would normally, except for one key difference.
While making the connection, make sure the name of the host
is the name of the database service you’ve chosen in docker-compose.yml
. (Outside docker, you would have typically used localhost
to connect to a local PostgreSQL server).
That’s it! You’re off to the races now. Use the DB as you normally would using
To Stop Services
You have two options here:
docker-compose stop
will stop the services, which you can restart usingdocker-compose start
.docker-compose down
will and remove containers as well. Rundocker-compose up
to get going once again.
Aug 8 edit: I’m just learning that footnotes from my original blogpost didn’t carry over to medium. Here they are:
- I harp on ‘personal’ given I’m not setting up appropriate roles, auth etc needed in a work environment. But, it’s good enough for my personal use.
- This is my own image based off of
[rocker/rstudio](https://hub.docker.com/r/rocker/rstudio)
-
_Attributions:_DB icon by Pixel perfect Folder icon by Icongeek26 RAM icon by Freepik Storage icon by Smashicons
- These are just labels, you can call them what you’d like
Originally published at https://rsangole.netlify.app.