The world’s leading publication for data science, AI, and ML professionals.

Docker based RStudio & PostgreSQL

This is part one of the two part post related to Docker, PostgreSQL databases and Anomaly data-sets.

Background

In recent LinkedIn posts ( original and Rami‘s repost) and tweet, I asked the internet for their favorite datasets for anomaly detection problems, particularly in the time-series domain. I got lots of responses, and now have a massive amount of data to play with, thank you folks who responded.

To play with these data, I wanted a better solution than keeping data as CSV files, qs objects, or in R Packages. I wanted a database to store the raw input data, my processed data and algorithm results. While I’ve setup databases in the past the traditional way, this time I wanted the entire codebase to be portable & reproducible.

As a result, I setup everything using Docker and it works seemlessly. If you’d like to learn how to do so, follow along my next posts:

  • Part I – (this post) Will teach you how to setup a simple reproducible Docker based workflow for a personal PostgreSQL Database + RStudio for code development
  • Part II – (next post) Will be the ETL pipelines for anomaly data

Why should you read this?

At the end of this tutorial, you’ll be able to rapidly setup a Docker based personal¹ PostgreSQL database. You will learn how to quickly deploy PostgreSQL & RStudio using docker-compose. You will be able to access the database in R and begin development immediately. And most importantly, the whole process will be fully reproducible as you inherit the benefits of setup scripts and Docker.

This tutorial assumes you’re familiar with Docker and RStudio in a Docker environment. If you’re not, I recommend reading Reproducible Work in R first.

Overview

You will be launching two Docker images:

  1. A PostgreSQL image. I choose [postgres:13.3](https://hub.docker.com/_/postgres/)
  2. An Rstudio image. I choose [hatmatrix/blog:base](https://hub.docker.com/r/hatmatrix/blog/tags?page=1&ordering=last_updated) ²

To permanently store your data beyond the life of the containers, you will mount two volumes, one for each container. I chose:

  1. For PostgreSQL: $HOME/docker/volumes/postgres
  2. For R Projects: $HOME/github
Overview of the setup, created by author. Icons obtained from flaticon.com³.
Overview of the setup, created by author. Icons obtained from flaticon.com³.

None of these paths, except for the in-container PostgreSQL are special; you can customize the others to your liking. By default, postgres:13.3 expects the database to be at /var/lib/postgresql/data. If you choose another database, modify this accordingly.

I use docker-compose to launch both PostgreSQL and RStudio services together. It’s convenient while also ensuring the PostgreSQL service runs first followed by RStudio. It’s easy to start or stop all the services using just a few commands.

First Time Setup

You need to run these steps the first time you’re setting up the PostgreSQL database. I’ve stored these steps in 00-postgres-init.sh.

1 – Directory Setup

You need a local directory to store the PostgreSQL database in. Lines 3–10 take care of this for you.

2 – PostgreSQL Setup

Now it’s time to setup the database. You need two steps at a minimum to get started:

  1. A new ‘role’ (akin to a login) with rights to create new databases.
  2. At least one database to work in. In my script, I’m making two: work and anomaly.

To manipulate the database, you need a PostgreSQL server running to process the psql commands. You’ll launch one using docker run. You need the correct volume mounted using -v. Next, we create the role and databases by piping psql commands into docker exec. Then, we stop the container

In summary, now I have a PostgresSQL database:

  • stored at $HOME/docker/volumes/postgres
  • with a new role rahul and password pass
  • with 2 databases: work and anomaly

Daily Workflow

tldr: How do you get going?

  1. Store [docker-compose.yml](https://github.com/rsangole/postgres/blob/master/docker-compose.yml) in a local directory
  2. Modify it if you’ve changed my chosen images/directories
  3. In shell, run docker-compose up -d

protip: to launch a browser (firefox for me) directly into RStudio as well, run this command in the directory where you have docker-compose.yml:

docker-compose up -d; firefox localhost:8787

pro-protip: save an alias and generalize the command. The -f arg instructs docker-compose which file you’d like to use. Now can be run from anywhere in the system.

The breakdown

What’s in [docker-compose.yml](https://github.com/rsangole/postgres/blob/master/docker-compose.yml)? We’re creating two services, one called db and the other rstudio.⁴

Let’s look at db first. Most of the arguments will look familiar if you’re familiar with docker run args. What’s new here is the restart: unless-stopped arg which tells Docker to only start PostgreSQL if it’s currently stopped.

The 2nd service is rstudio. Apart from the typical args, the interesting arg here is depends_on which tells Docker to only run this image after the database is up and running. Fantastic!

Connecting via R

Use to test your connection. Run your DBI:: commands you would normally, except for one key difference.

While making the connection, make sure the name of the host is the name of the database service you’ve chosen in docker-compose.yml. (Outside docker, you would have typically used localhost to connect to a local PostgreSQL server).

That’s it! You’re off to the races now. Use the DB as you normally would using

To Stop Services

You have two options here:

  1. docker-compose stop will stop the services, which you can restart using docker-compose start.
  2. docker-compose down will and remove containers as well. Run docker-compose up to get going once again.

Aug 8 edit: I’m just learning that footnotes from my original blogpost didn’t carry over to medium. Here they are:

  1. I harp on ‘personal’ given I’m not setting up appropriate roles, auth etc needed in a work environment. But, it’s good enough for my personal use.
  2. This is my own image based off of [rocker/rstudio](https://hub.docker.com/r/rocker/rstudio)
  3. _Attributions:_DB icon by Pixel perfect Folder icon by Icongeek26 RAM icon by Freepik Storage icon by Smashicons

  4. These are just labels, you can call them what you’d like

Originally published at https://rsangole.netlify.app.


Related Articles