Creating an environment with Airflow and DBT on AWS (part 2)

Installing DBT and some settings to make the work easier

Arthur Chaves
Towards Data Science

--

Photo by Boitumelo Phetla on Unsplash

In part 1 of this article, we launched an EC2 instance, installed some OS requirements and then Airflow. Now, we are going to install DBT. But first let’s make some settings to simplify some things.

1. Connecting to the ec2 instance with airflow user

Until now, we were connecting to the EC2 instance using the default user, ec2-user. Then we switched to the user Airflow, the owner of our installations. We can simplify the connection step, allowing the connection with the airflow user directly. To do this, first go to VSCode, connect to the EC2 instance and switch to the airflow user. Now, we need to create a directory to store the ssh authorized keys, copy the file with them (the file allows the connection using the .pem file generated when we launched the instance) and adjust some permissions. Here are the commands.

$ mkdir ~/.ssh
$ sudo cp /home/ec2-user/.ssh/authorized_keys ~/.ssh/authorized_keys
$ sudo chown airflow:airflow ~/.ssh/authorized_keys
$ sudo chmod 700 ~/.ssh
$ sudo chmod 600 ~/.ssh/authorized_keys

In VSCode, press F1, type SSH and choose “Add new SSH host”. Now, use the same command we used in part 1, but changing the user.

ssh -i airflow_server.pem airflow@myinstancehost

Follow the same instructions of part1 to adjust the path for the .pem file in the SSH configuration file, choose a name for the new connection and then save the file.

The SSH configuration file with the new connection

After that, press F1 again, choose Connect to host and choose the new connection. Now you’re connected to the instance with the airflow user. Whenever you need to stop the instance and connect again, you could consider just use this connection, with this user.

2. Starting Airflow services automatically

To start Airflow as we did in part 1, we could just run the airflow commands (webserver and scheduler). But we would have to open two terminals and leave them opened on VSCode. We could just use nohup to detach the command from the terminal. A even better option is to run these services as daemons. This can make easier not only to start/stop the services, but to automatically start the services along with the instance.

To do this we will use the files available in Airflow github, with some adjusts.

First, create a file in /etc/sysconfig, named airflow:

$ sudo vi /etc/sysconfig/airflow

And the content:

AIRFLOW_CONFIG=/home/airflow/airflow/airflow.cfg
LD_LIBRARY_PATH=/usr/local/lib

This defines two environment variables which will be used when running Airflow commands, with the path for the airflow configuration file and for the directory with the correct version of sqlite.

Now, create a file in /usr/lib/systemd/system named airflow-webserver.service:

$ sudo vi /usr/lib/systemd/system/airflow-webserver.service

And the content as below. Notice that I am passing the port, even though I am using the default, 8080. If you intend to have other application running in the same server using the 8080 port, you can change to use another one:

[Unit]
Description=Airflow webserver daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
PIDFile=/run/airflow/webserver.pid
EnvironmentFile=/etc/sysconfig/airflow
User=airflow
Group=airflow
Type=simple
ExecStart=/usr/bin/bash -c 'source /home/airflow/.venv/airflow/bin/activate; airflow webserver -p 8080 --pid /run/airflow/webserver.pid'
ExecReload=/bin/kill -s HUP $MAINPID
ExecStop=/bin/kill -s TERM $MAINPID
Restart=on-failure
RestartSec=5s
PrivateTmp=true
[Install]
WantedBy=multi-user.target

Create another file in the same directory, named airflow-scheduler.service:

$ sudo vi /usr/lib/systemd/system/airflow-scheduler.service

And the content:

[Unit]
Description=Airflow scheduler daemon
After=network.target postgresql.service mysql.service redis.service rabbitmq-server.service
Wants=postgresql.service mysql.service redis.service rabbitmq-server.service
[Service]
PIDFile=/run/airflow/webserver.pid
EnvironmentFile=/etc/sysconfig/airflow
User=airflow
Group=airflow
Type=simple
ExecStart=/usr/bin/bash -c 'source /home/airflow/.venv/airflow/bin/activate; airflow scheduler --pid /run/airflow/webserver.pid'
KillMode=process
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target

Finally, create a file /usr/lib/tmpfiles.d/airflow.conf, with:

$ sudo vi /usr/lib/tmpfiles.d/airflow.conf

And the content below:

D /run/airflow 0755 airflow airflow

Now create one directory, owned by airflow:

$ sudo mkdir /run/airflow
$ sudo chown -R airflow:airflow /run/airflow

Ok, now the two services, Scheduler and Webserver, are able to run as daemons. To make them start automatically, just run these commands:

$ sudo systemctl enable airflow-webserver.service
$ sudo systemctl enable airflow-scheduler.service

If you still haven’t started Airflow since you started your instance, now you can do this just by running:

$ sudo systemctl start airflow-webserver.service
$ sudo systemctl start airflow-scheduler.service

To check the status of the services, you can run the commands

$ sudo systemctl status airflow-webserver.service
$ sudo systemctl status airflow-scheduler.service
The green message, showing the service is running

The result we expect is the status “active (running)” on green. If you ever have something different, showing the service is not running, you can check the output of the services with the commands (the 100 is the number of lines I want to show, you can change it):

$ journalctl -u airflow-webserver -n 100
$ journalctl -u airflow-scheduler -n 100

3. Installing DBT

Now, it’s time to install DBT. Since we are going to use Airflow to orchestrate DBT jobs and want to make a simple and cheap configuration, we will install it on the same server.

If you follow the part 1 of this article, you already installed all the OS requirements. Now we just need to install the Python packages. First, activate the python environment we created:

$ source ~/.venv/airflow/bin/activate

Now, install these two packages required to install DBT

$ pip3 install pyicu-binary pyicu

Finally, we can now install dbt, with the command:

$ pip3 install dbt

After a few seconds, you will have the successful message. If, during the installation of dbt, you have some message of failure informing something about .h files not found, it’s probably because gcc package is looking only for the /usr/include folder. If this happens, run the following command to create symbolic links to the headers in the folder where gcc is trying to find them:

$ sudo ln -sv /usr/include/python2.7/* /usr/include/

You can then check if everything worked, by using:

$ dbt --version

The command will show the version of dbt and its plugins.

The output when checking the dbt version

4. Launching a database on RDS

DBT is a tool to run on a Data Warehouse. Altough it is compatible with Redshift, it is also with Postgres. To avoid some unexpected billing with Redshift (due do free tier period expired or cluster configured with resources/time above the free tier), which could be really expensive, we are going to use Postgres, on RDS. You can choose to use Redshift, just pay atention to the configurations we are going to make and you’ll probably manage to switch when needed. Also remember to use the port 5439, instead of 5432.

First, go to the AWS Management Console and access the RDS page. Select create database. In the configuration page, select the ‘Standard create’, choose ‘PostgreSQL’, and you can keep the 12.5-R1 version.

Creating the database

Choose the Free tier template, to make easier to configure this database. Choose any name to your instance, the master username and some password.

In the “Public access” option, that you need to choose ‘Yes’, so you can connect from your network.

You don’t need to change any of the other options in this page, so just click on ‘Create database’.

You will be redirected to the list of databases. Select the one you just created. Wait some minutes until you see the status is ‘Available’.

Click on the name of the database, to enter in the configuration page. In the ‘Connectivity and security’ tab, click on the link in ‘VPC security groups’.

The link to open the security group rules

You will open the security group rules to the database. In the bottom of the page, there are some tabs. Select inbound rules, and then ‘Edit inbound rules’.

Adding new rules to the security group

Click on ‘Add rule’. In the type, choose ‘PostgreSQL’, and in source, use the private IP of your EC2 instance. Then, click on ‘Save rules’. If you can’t find this IP, you just have to open the EC2 page on a new tab, select your instance, and copy the IP in “Private IPv4 addresses”.

Allowing your instance to connect to the database

Now, connect to your new database (I suggest to use DBeaver, but you’re free to choose you own tool).

Creating the connection on DBeaver is pretty simple

Within the default database, postgres, create two schemas, one named ‘data_warehouse’ and the other named ‘data_lake’. If you rather, you can also create a specific database to do this, but I’ll keep simple, and use the postgres database.

The database connection on Dbeaver, with the two new schemas

Also, create a table on your database and add some data. Just run this on DBeaver:

create table data_lake.user (name varchar(256), birth_date DATE, email varchar(256));insert into data_lake.user values
('Ted Mosby', '1978-07-04', 'ted.mosby@mosbiusdesign.com'),
('Robin Scherbatsky', '1980-12-04', 'bus.lady@world.com'),
('Barney Stinson', '1976-11-02', 'suitup@gnb.com'),
('Marshall Eriksen', '1978-03-16', 'fudge@supreme.com'),
('Lily Aldrin', '1979-09-22', 'comisisoner@slapbet.com');

5. Configuring DBT

The first thing we need to configure is the profiles file. In this file, we put the information to connect to our Data Warehouse, so when we have different bases and schemas, here is where we are going to add them.
The default path to store this file is in a hidden folder named .dbt. You can also change the permissions to avoid unauthorized access to this file, cause for now, we are going to keep the password in this file. So, to create the directory and the file:

$ mkdir ~/.dbt
$ vi ~/.dbt/profiles.yml

And add the information to connect to your database, switching to your own information when it is the case:

dw_dev:
target: data_warehouse
outputs:
data_warehouse:
type: postgres
host: your-db-host
user: postgres
pass: postgres
port: 5432
dbname: postgres
schema: data_warehouse
threads: 4
keepalives_idle: 0

Now, we need to create the dbt_project file. This file contains the information about the project. Let’s also create a folder to keep the files we are going to use with DBT:

$ mkdir ~/dbt
$ cd ~/dbt
$ vi ~/dbt/dbt_project.yml

In DBT page, you can find an example of this file. For now, we are going to add just some properties. After you advance in your usage of DBT, you can add other properties. So, add the following to this file:

name: 'project_dbt_airflow'
config-version: 2
version: 1.0
profile: dw_devsource-paths: ['source']
data-paths: ['data']
test-paths: ['test']
analysis-paths: ['analysis']
macro-paths: ['macro']
snapshot-paths: ['snapshots']
models: +materialized: table
+enabled: true

Now, create the folder to store the models (the .sql files). As you can see in the content of the previous file, we could also have other directories (for tests, macros etc.), but we are not going to use them for now. Also, create a file to run our first model on DBT:

$ mkdir ~/dbt/source
$ vi ~/dbt/source/user.sql

And add just this one line:

select * from data_lake.user

Now, we are ready to run dbt for the first time. Go to the dbt project folder and run the dbt command:

$ cd ~/dbt
$ dbt run
The output after run the dbt run command

What DBT just did now was to run the query and then create a model (in this case, a table) with the result of the query. If you have an architecture in which you can access the source data from the data warehouse connection (like a data lake on s3, accessed by a data warehouse on Redshift with Spectrum), you can just run your models querying your source like this. If that is not the case, you would first have to move the data from your source to your data warehouse (as it is specified on the DBT documentation, it works only with the T of ETL). But you can always add some step on Airflow to do this as part of your pipeline.

Anyway, you can check the data in your data warehouse on DBeaver, using:

select * from data_warehouse.user;

Now, we have installed DBT and Airflow on our instance. In the next parts, we are going to integrate them, using Airflow to orchestrate DBT jobs, and also use DBT Cloud, to check other option of integration.

Read all the parts of this article:

Part 1: Launching an instance and installing Airflow
Part 2: Installing DBT and some settings to make the work easier
Part 3: Using DBT Cloud and integrating Airflow with DBT

References:

https://www.getdbt.com/
https://airflow.apache.org/
https://github.com/apache/airflow/tree/master/scripts/systemd
https://docs.getdbt.com/reference/dbt_project.yml

Sources that helped me with the bases of most what I did here:
https://www.datascienceacademy.com.br
https://docs.getdbt.com/docs/introduction

--

--