Building Spotify Discover Weekly Email Alert with Luigi

A simple Luigi data pipeline

Ricky Kim
Towards Data Science

--

Photo by Ryan Quintal on Unsplash

First of all, I would like you to understand that this is a record of my personal learning. You might find a better way to implement things than what I have tried and shown here. If you have any suggestions to improve the code, I’d be very happy to hear advice, comments.

If you are a Spotify user, you may have heard of their feature called “Discover Weekly”. Discover Weekly is a playlist of 30 songs, that Spotify recommends based on your listening history. I absolutely love Discover Weekly, and sometimes even get scared a little bit of how Spotify knows me too well. And it is not just me, it appears.

The only problem I have with Discover Weekly is that I can’t access my historical Discover Weekly since it automatically refreshes every Monday. When I forgot to save the songs I like to my libraries or playlists, next week the list is completely gone, and I have no way to figure out what was that song I absolutely loved but forgot to save.

The small private project I am sharing here started from the above problem. The solution I came up with is to extract the list of songs from Discover Weekly every Monday and send the list as an email to myself. It was a perfect opportunity to try a simple data pipeline with Luigi.

Requires:

  • Spotify account
  • Gmail account
  • AWS account

(BTW I am using a Macbook for this project, some of the steps might be slightly different if you are on Windows.)

Getting Spotify API Access

In order to be able to access Spotify programmatically, you need client ID and client secret from Spotify. You can get them easily by going to Spotify for Developers page. Once you are in the dashboard page, you can click on the green “CREATE A CLIENT ID” button, then you will be asked questions like app name, description, etc.

Next, you will be asked if the app is for commercial use. In my case, I am just building this for myself, so I clicked “No”.

Finally, tick the checkboxes, then submit.

Then you will be given a client ID and a client secret that you can use to access Spotify API.

Take a memo or copy and paste it somewhere because we will need these later. Click “EDIT SETTINGS”, and add “https://localhost:8080” to Redirect URIs. In a proper app, this will redirect the user to the app after confirming the API access, but in this case, this will only be used as a part of authentication params.

One last thing you need to do is to follow Discover Weekly on your Spotify. This makes it possible to retrieve Discover Weekly from our Python program.

Launching EC2 Instance on AWS

Sign in to your AWS Management Console, and click into EC2. I am writing this assuming that you already have an AWS account. Click “Launch Instance”. For this project, I chose Amazon Linux AMI 2018.03.0.

Make sure the instance type chosen is t2.micro which is eligible for free tier. One important step we should do is to open the port for Luigi so that we can access Luigi’s central scheduler.

Keep the default settings for the rest except for “6. Configure Security Group”. Once you get here, click “Add Rules” and choose Custom TCP from the “Type” dropdown, and type in “8082” in “Port Ranges”. Luigi’s central scheduler GUI uses port 8082 as a default, so this step enables us to access Luigi GUI on a web browser. As an additional step, you can add your own IP address in the “Sources” section, so that you only allow inbound traffic from a certain IP address. If you want to explicitly your own IP address only type in “your-IP-address/32” in the Sources section. Now click “Review and Launch”.

Next, you will be prompted to either choose an existing key pair or create a new one. The key will be used when communicating with your instance from your local machine. Let’s create a new key pair for the project.

First, download key pair, then finally launch instance. Go back to EC2 section of AWS console, and you might see the instance is still not in “running” state. Give it a few moments, and when it turned to “running”, take a good note of its Public DNS (IPv4) and IPv4 Public IP.

Additional AWS Preparation

Either from terminal or on AWS web console, create an S3 bucket named “luigi-spotify”. This will be later used to store the list of songs extracted from Spotify as TSV.

Connecting to EC2 Instance

I hope there was nothing too complicated so far. Now since we launched the instance, we can ssh into it. Before we do that, we need to change the file permission of the key pair we downloaded above, because EC2 instance will not accept key file which is publicly viewable. Open your terminal and run below command after replacing “directory…” part to your own directory

chmod 400 directory-where-you-downloaded-the-key-file/luigi_spotify.pem

There are 3 permissions (Read, Write, Execute) for 3 types of users (User, Group, Others). What the above line of code does is changing the file permission so that the key file has only one permission (Read) allowed to one type of user (User). Now we are ready to ssh into our instance. Again please replace the part “directory…” and “your-instance…” with your own directory and public DNS.

ssh -i directory-where-you-downloaded-the-key-file/luigi_spotify.pem ec2-user@your-instance's-public-DNS

Preparing EC2 Instance for Luigi Tasks

Once in your EC2 instance, let’s first install Git so that we can clone the repository I prepared for this project.

sudo yum -y install git

Now clone the repository using git clone command.

git clone https://github.com/tthustla/luigi_spotify.git

Go to the cloned directory, and let’s first take a look at files.

cd ~/luigi_spotify
ls

ec2-prep.sh will be used to install required packages. luigi.cfg is a configuration file where you will put all the API keys and credentials. luigi_cron.sh is a bash script that runs Luigi pipeline defined in run-luigi.py.

Make both of the bash scripts executable by running below command.

chmod 777 *.sh

Now let’s first run the ec2-prep.sh

./ec2-prep.sh

Luigi

Before we actually run the pipeline, it’d be good to have an understanding of what the pipeline does and how it does it. Below is the code for run_luigi.py.

On a high level, it performs two tasks. First, get the list of songs from the Discover Weekly playlist, and store them in S3 as TSV. Once it’s finished storing TSV, then with the TSV, it creates an email message that shows [Song Title] — [Artist] as Spotify links, then send the message to yourself. The first task is defined in GetWeeklyTracks class, and the second is defined in SendWeeklyEmail class. In order for these tasks to be able to run, it needs credential info, and these are retrieved from luigi.cfg file using luigi.Config class. Getting Spotify API token, and establish the connection to Spotify is being done outside of Luigi tasks.

Running Luigi on Local Scheduler

Next thing we need to do is filling in the information in luigi.cfg file. First open the file with Nano.

nano luigi.cfg

Fill in each value with your own credentials without quotes or double quotes. Now we are finally ready to do a local test run of the pipeline within our EC2.

python3 run_luigi.py SendWeeklyEmail --local-scheduler

Due to Spotipy’s (a Python library for Spotify API) authentication flow, you will see instructions like below.

If tested on your local machine, this would have opened the web browser, but it doesn’t open on EC2 since there is no web browser installed. Copy the URL (the blue highlighted part in the above screenshot), and paste this to your local machine web browser and open.

If you see a screen like above, click “AGREE”, then it will show error page like below. You don’t have to worry about this error page. This happens because the redirect URI we provided is just a localhost port without anything running on it. Copy the URL address of the error page, there is a code embedded in the URL that will be used by Spotipy’s autehntication flow.

Now back to your EC2 terminal, paste the URL to the console, where it shows “Enter the URL you were redirected to:”. Now chances are high that this won’t succeed at first try because Google blocking this Gmail login from an unknown IP address at first. If this happens, log in to your Gmail using your local machine web browser, then try run the command again. If everything goes well you will have received the email sent from the trial run.

Running Luigi on Central Scheduler

We are almost there. The train run succeeded, now we have a few more steps to go. Now let’s do a proper run of the pipeline with Luigi’s Central Scheduler, so that we can access Lugi GUI. First, create a directory for log files.

mkdir ~/luigi_spotify/log

When we do a background run console output will be stored in log files in the above directory we just created. Let’s launch Luigi Central Scheduler.

luigid --background --logdir ~/luigi_spotify/log

Since we have opened the port 8082 to access GUI, we can now open GUI on a web browser. Open the page with the IPv4 Public IP and “:8082” attached as below.

your-EC2-pulic-IP:8082

We haven’t run any tasks yet, so you won’t see any tasks now. Now let’s run the pipeline without “ — local-scheduler” param at the end. You might want to delete the folder created in the S3 bucket during the trial run to see the whole pipeline running from scratch. Otherwise, Luigi will see the folder and files in the S3 bucket, then just check the output files are there and mark the task as success without running any of the tasks. Now you will see two tasks run successfully.

Hooray!!

Image courtesy of vernonv on Redbubble

Setting Up a Cron Job

The very last part is to set up a Cron job so that we can decide when and how frequent these tasks run. The one thing you have to consider here is that your EC2 instance’s Linux time might be different from your local time. Run below command to set the time zone in your EC2 instance.

cd /usr/share/zoneinfo
tzselect

After you select the right timezone for you. Copy the part that’s highlighted in blue from your own terminal and run. Since I will finish setting up my EC2 without restarting, I just directly run the code on the terminal without appending it to ‘.profile’.

TZ='Europe/London'; export TZ

We will be setting up a Cron job with luigi_cron.sh that will run run_luigi.py. As you will see from the Cron command below, I am specifying LC_CTYPE with the correct value of the EC2 instance. This small part took me a while to figure out. The same file, the same tasks were throwing encoding error when run as a Cron job, while it works perfectly fine without Cron. After a long googling I finally found the way that works. You can find the EC2 instance’s LC_CTYPE value by run “locale”.

locale

Once you have that LC-CTYPE info, open a Crontab with below code.

crontab -e

You won’t find anything there yet. Press “i” and go into “insert” mode, then paste below code, and press esc then type “:wq” to write the changes and exit.

0 8 * * MON LC_CTYPE="en_GB.UTF-8" ~/luigi_spotify/luigi_cron.sh

Above Crontab expression will schedule the bash script to run 08:00AM every Monday, but you can change it to your own preference. If you need help with Crontab expression, you can try your own expression at https://crontab.guru/.

If you want to check if the Cron works, you can first set the Crontab value as below (It will run the task every minute), then check if it works, and change it back to the weekly Crontab value you want to set. Again if you want to do this check, please don’t forget to delete the folder from your S3 bucket.

*/1 * * * * LC_CTYPE="en_GB.UTF-8" ~/luigi_spotify/luigi_cron.sh

That is it! Now the Luigi pipeline will run every Monday to fetch songs from my Discover Weekly and will send me an email!

I know this is not a very complicated task. But it was such a wonderful learning experience for me. Of course, there are spaces for improvements in my code implementation, but I am one happy man today who solved one of my daily problems using data and Luigi!

Thank you for reading. You can find Git repository from the below link.

https://github.com/tthustla/luigi_spotify

--

--

The Rickest Ricky. Love data, beer, coffee, and good memes in no particular order.