The world’s leading publication for data science, AI, and ML professionals.

Simulating Continuous Learning Models with Airflow

Configure a workflow management tool to recreate models at regular intervals within 13 min

Photo by Arnie Chou from Pexels
Photo by Arnie Chou from Pexels

Continuous learning is a process by which a system adaptively learns about the external world. In the context of machine learning, a model continuously learns more about the subject domain it is trained on through autonomous and incremental development. This allows the model to learn more complex logic, knowledge, and features that can help make better predictions.

Continuous learning is especially important for updating models. Suppose you’re hired as a data scientist for a manufacturing company the produces dumbells, like Nautilus. Your goal is to predict revenue and sales in the next quarter. From March 2017 – March 2020, your model was evaluated at a 95% accuracy on revenue projections. However, from March 2020-January 2021, your model’s accuracy was 40%. What happened?

The major event that happened from March 2020 – January 2021 was the coronavirus. Quarantine regulations forced people to stay indoors and not go out frequently. Fitness enthusiasts who could no longer go to the gym had to set up their internal gyms to keep up with their exercise regimen. Nautilus is a leading company in dumbbell and home fitness equipment, so sales skyrocketed during the pandemic¹. In this case, the model you created in 2017 would severely underestimate sales predicted for 2020. This is a prime example of why a model needs to be retrained on recent and emerging current data. The world moves incredibly fast, and insights from data 3 years ago may not apply to evolving data.

You might have realized that 3 separate models need to be created: one trained on data pre-pandemic (before March 2020), one trained on data during the pandemic (March 2020-May 2021), and one trained on data post pandemic (May 2021-present). However, you may not have the luxury of cleanly splitting your dataset by specific time periods. Because you can make assumptions of the pandemic start and end date (e.g., assuming end of the pandemic is when the vaccines rolled out), you know exactly where to split your dataset.

Various datasets can subtly change over time, and you’d need to apply a continuous learning approach so that your model can learn with new information being presented. Those data reflect both the dynamic environment in which the data were collected, but also the form and format in which they are provided. As example of the latter evolution, consider an image classifier that determined whether a page from a 1000-page pdf was a text or a form. Over time, the format of the form changes to the point where the original image classifier recognizes it as text. Because you don’t know exactly when the forms changed, you’d need to train one model on all the current data. Furthermore, the image classifier has to be retrained on newer form data while keeping track of older form data.

A continuous learning approach is needed for a model to retrain and update based on newer data. That being said, implementing this approach mathematically for deep-learning models and recurrent neural networks can be very tricky. Many data scientists do not have the knowledge to build such custom models within a company deadline. However, there is a way to simulate continuous learning using a workflow management tool such as Airflow². On a scheduled basis, Airflow can fetch new training data and rebuild the model using prior and new data. Data scientists can then build simple models (whether they’re simple PyTorch³ neural networks they built themselves, or whether they’re utilizing existing machine learning libraries for Python such as scikit-learn), and focus on updating the training data on a continuous basis.

This article will go over an example of how to apply continuous learning with Airflow to time series models. Our models will use ARIMA to analyze historical stock market data for two stocks: Fastly and Nautilus.

DISCLAIMER: This is not investment advice. That being said, I do own these two stocks. So I’m inclined to promote them just so I can benefit. Do your research before spending your hard-earned money.

Also, please read up more on time series before you decide to make stock market predictions with ARIMA. Time series models are effective when there is a recognizable pattern in the data set. I have not taken the time to analyze whether ARIMA is appropriate to forecast Fastly and Nautilus stock historical data. This article discusses building a continuous learning architecture from a data engineering standpoint. ARIMA and time series were used to show that this architecture can be used on various types of models (time series, classification, neural networks). Analyzing forecast metrics of such time series models is to be considered out of the scope of this article.

Why Airflow?

Airflow is an open-source workflow management tool that is written in Python. It has become industry standard, and is fairly easy to set up. There are cons with Airflow when dealing with complex and dynamic workflows, and it’s better to use a different workflow tool such as Prefect. For this tutorial, we’re creating a simple workflow. Thus, Airflow is perfect.

We’ll set up Airflow on AWS EC2. You can also utilize AWS MWAA (Amazon Managed Workflows For Apache Airflow), if you wish. However, keep in mind that AWS MWAA can be expensive for some developers’ budgets. For example, I was charged $27 per day just to leave it up and running, and $80 a month for other AWS Elastic Cloud Services in usage (NAT Gateways, for example).

Terminology

  • AWS MWAA – Amazon Managed Workflows For Apache Airflow. Handles all setup and configurations for Airflow, but is expensive to maintain.
  • AWS EC2 – A virtual server to run applications on.
  • AWS S3 – A bucket for storing pickled models.
  • ARIMA – AutoRegressive Integrated Moving Average. It is a class of model that captures a suite of different standard temporal structures in time series data.

Suggested Reading

This tutorial assumes that you have a basic understanding of Airflow and DAGs. If not, please read Airflow: how and when to use it.

To understand the architecture of Airflow, read An Overview of Apache Airflow Architecture.

This tutorial uses ARIMA for time series analysis. This topic is out of the scope of this article. If you want to understand time series and ARIMA more, I recommend Introduction to Arima: Nonseasonal models.

Architecture

Airflow architecture diagram I created using Cloudcraft
Airflow architecture diagram I created using Cloudcraft

Tutorial

For this tutorial, we will install Airflow on a Linux Ubuntu Server 20.04 LTS. We can create a AWS EC2 image using type t2.medium. To read more on AWS EC2, I recommend Amazon’s tutorial available here.

Airflow needs to run both a webserver and a scheduler, so it’ll need more memory than what is provided in a free-tier. t2.medium has enough resources to install and run Airflow.

We also use Ubuntu because it’s one of the few Linux operating systems that is suitable for machine learning. Other Linux operating systems (including AWS Linux) will not have the exact packages and dependencies needed for some machine learning libraries, including Facebook Prophet. They have alternatives for those packages, but it wouldn’t matter if Facebook Prophet can’t compile those correctly. Facebook Prophet requires a C++ standard known as C++14 just to compile all its dependencies. C++14 is supported in Ubuntu (and Mac OS), but not in CentOS or AWS Linux. Ubuntu and MacOS are very useful for downloading future machine learning packages and compiling them correctly.

Create an EC2 Instance

Sign into the AWS Console and navigate to EC2 > Instance > Launch Instance. Configure your instance so that it matches the following below.

NOTE: I removed my IP address under Source in Security groups. But to secure your instance to make sure no one else has access to it, make sure that you select My IP instead of Anywhere)

EDIT: In addition to these security groups, you should ALSO add HTTP, TCP, Port 8080. That port is used by airflow webserver by default. You should change the port if you want to use a different airflow webserver port.

When Launching the instance, it will ask you where to download a .pem file. Name this airflow_stock.pem, and store it in a folder of your choosing.

Wait until the EC2 instance is running, and then call this command on your local terminal.

ssh -i "airflow_stock.pem" [email protected]

SSH is a Secure Shell Protocol that allows you to log into the cloud instance on your local command line. It needs a .pem file to authenticate that you can log into the server.

NOTE: If you get an Unprotected Private Key error, it means your .pem is visible for everyone to see. You just need to adjust the permissions of the .pem. Run this command before executing

chmod 400 airflow_stock.pem

For more information on this error, see this article here.

Install Pip and Airflow

Assuming you have a clean Ubuntu instance, you’ll want to install both pip and apache-airflow. To install pip, run the following commands

sudo apt update
sudo apt install python3-pip

We will be using Python 3 for the tutorial. To install apache-airflow, run this command

pip3 install apache-airflow

Next step is to create files as part of the Airflow project.

Create Folder Structure

You can create the following files below, or fetch from the git repo.

This folder has 4 components: a bash script for adding airflow environment variables (add_airflow_env_variables.sh), a requirements text to install pip packages (requirements.txt), a start airflow script (start_airflow.sh), a stop airflow script (stop_airflow.sh), and a dags folder which pulls stock data and rebuilds ARIMA time series models from all data.

Screenshot of folder structure for AIRFLOW in Visual Studio Code
Screenshot of folder structure for AIRFLOW in Visual Studio Code

The add_airflow_env_variables.sh contains all environment variables we need to set before we run airflow. For now, it just lists AIRFLOW_HOME, the path where we store the logs/output/database for airflow. I cloned the repo in my ec2 instance, so I’m using the file path from that github repo.

The requirements.txt file contains all the pip packages we need.

statsmodels
pandas
pandas_datareader
boto3
awscli

To install all the packages in requirements.txt, simply run

pip3 install -r requirements.txt

The start_airflow.sh contains all the commands to start airflow on the EC2 server.

The stop_airflow.sh contains all the commands to stop the current airflow webserver and scheduler running on the EC2 server.

The stock_data_dag.py will contain the logic to trigger the workflow. The workflow is broken up into two functions: extract and load.

Extract (_get_stockdata) will fetch all of the historical prices of a stock from TIINGO from January 1st, 2015 to the day of the trigger. For purposes of this tutorial, this platform is sufficient. However, in a production environment, the platform would require getting only the current day’s data and append to prior historical data stored in an S3.

Load (_store_arima_model_ins3) creates ARIMA models out of the historical stock prices, and stores the pickled files in an s3 bucket. We can fetch those new models that have been retrained on new training data.

NOTE: You would need to create a TIINGO account to get the API key used in get_stock_data (line 25).

Consider the python functions _get_stockdata and _store_arima_model_ins3 as tasks. In order to call those tasks on the DAG, we’ll need to use operators. Operators are the main building blocks of the DAG that execute certain functions/scripts/apis/queries/etc. In this example, we’re creating 4 different Python operators:

  • get_NLS_stock_data
  • store_NLS_arima_model_in_s3
  • get_FSLY_stock_data
  • store_FSLY_arima_model_in_s3

The workflow schedule is formatted in crontab: 0 17 1–5. This means that this workflow will get triggered at 5 pm every weekday.

The _stock_datadag lists two different tasks for each stock: get_stock_data, and _store_stock_arimamodel to s3. _get_stockdata fetches TIINGO historical stock data since January 1st, 2015 and stores it in a JSON. _store_stock_arimamodel fetches the stock JSON, gets the adjusted closing prices for all days, builds an ARIMA model out of those prices, and stores the ARIMA model in an S3 bucket called stock-model . We can then download the models from the S3 and use them to predict next day’s price.

NOTE: This architecture is not just for ARIMA models. Various other skikit-learn models and algorithms such as XGBoost, RandomForest, NaiveBayes, LinearRegression, LogisticRegression, and SVM can benefit from this continuous learning approach. You can simply reuse the _store_arima_model_ins3 function of stock_data_dag.py file, or edit the python DAG directly and change line 44 to your desired model of choice. You can even add your own neural network logic in lieu of a skikit-learn model That being said, you’d have to rethink on what data you’re training your regression/classification models on. You may have to change _get_stockdata to fetch a different dataset for your model. Regardless of what business problem you’re trying to solve, this Airflow continuous learning architecture is interchangeable to work for a variety of use cases.

We now have all the files needed for this tutorial’s computations. The next step is to execute certain scripts and configure airflow for running.

Create AWS S3 Bucket

We want to create an S3 bucket to store our models in. Navigate to S3 and create a new bucket called stock-model . We’ll then configure the EC2 aws credentials so that we can use boto3 to store the models from Airflow into S3.

Configure AWSCLI

We’ll want to store our ARIMA models in an S3 bucket. To do so, we’ll need to configure the local AWS settings on the EC2 instance so that it’ll recognize the desired AWS account.

First, navigate to the Security Credentials page of the IAM Management Console: IAM Management Console (amazon.com).

Security Credentials Page for IAM
Security Credentials Page for IAM

If you haven’t done so yet, scroll down to Access Keys, and click on Create New Access Key. This will generate an excel file that you can download to get both your access key and secret access key. DO NOT SHARE THIS FILE WITH ANYONE ELSE. IF YOU DO, YOUR AWS ACCOUNT WILL BE AT RISK OF HACKING AND POSSIBLE OVERCHARGING.

On the EC2, type in

aws configure

AWS will prompt you for your Access Key and Secret Access Key. Put the values from the excel file in there. You can press enter on Default region name and Default output.

AWS credentials are now setup on EC2 so that you can create boto3 sessions.

Configure Airflow

First, add in the environment variables. Since we have a script for that, you can simply call the following commands.

source add_airflow_env_variables.sh
bash add_airflow_env_variables.sh

If you want to check if the environment variable AIRFLOW_HOME has been added, run env in the command line.

Next step is to create the airflow database for storing dag runs. If this is the first time you’re creating the airflow database, run the command below in the AIRFLOW_HOME path.

airflow db init

You’ll also want to add in a user, as airflow will prompt a user sign in. The command below is taken from the example from the Running Airflow locally documentation.

airflow users create 
    --username admin 
    --firstname Peter 
    --lastname Parker 
    --role Admin 
    --email [email protected]

It’ll prompt you for a password of your choice.

Next, install all the packages in requirements.txt using

pip3 install -r requirements.txt

Finally, run the following command to start airflow webserver and scheduler in the daemon.

bash start_airflow.sh

That will run airflow scheduler and webserver without any problem. To check if the UI works, navigate to this url.

http://ec2–11–111–111–11.us-east-1.compute.amazonaws.com:8080

ec2–11–111–111–11.us-east-1.compute.amazonaws.com is a made up Ec2 Public IPv4 address. You should be able to fetch it for your EC2 instance in the Details tab on the EC2 -> Instances console. Make sure you use http (NOT https) and port 8080 (if you set up airflow to use that port, else use whatever port you configured).

You should see a login page pop up. Input the user credentials used to create a user account in Airflow (username: admin, password: whatever you typed in). You should now see all the Airflow DAGs listed, including the stock_data DAG we created earlier.

List of all dags in Airflow, with stock_data turned on.
List of all dags in Airflow, with stock_data turned on.

Click on the stock_data dag. You can then see a UI of this dag in the graph view.

DAG view of stock_data
DAG view of stock_data

On the type right hand corner (under schedule), you will see a play button. Clicking on that button manually triggers the DAG. If everything works as intended, you should see all operators with green Success. If there are errors (failed, up for retry), click on the operators to inspect the logs.

Successful DAG run
Successful DAG run

Now, let’s navigate to the S3 bucket stock-model and see if our models have been added.

Stock-model bucket with models added
Stock-model bucket with models added

If you leave the EC2 running overnight and airflow scheduler and web-server uninterrupted, you’ll see these two models last modified around July 23, 5:00 pm UTC-05:00.

Note: Make sure to terminate EC2 when you’re done

Conclusion

This tutorial explained how to construct a continuous learning architecture using a workflow tool to rebuild ARIMA models on current and new training data. We learned how to utilize Airflow to create this continuous learning architecture for predicting closing prices for two different stocks: Nautilus and Fastly. We learned how to deploy Airflow on Amazon EC2, and store the newly built models on S3.

Continuous learning does not have to involve creating recurrent neural networks from scratch. This is a simpler alternative to create a continuous learning model through the use of a popular Data Engineering tool. Entry level data scientists who are still learning the intricacies of neural networks can fall back on this solution to create a robust, sustainable architecture for models "learning" on newer data. Furthermore, this architecture can be replicated on various different kinds of algorithms: time series, linear regression, and classification algorithms. While logic to read and train models on data will need to change, the overall architecture should accommodate various business problems.

Github Repo: Data-Science-Projects/Medium/Airflow_CL/airflow at master · hd2zm/Data-Science-Projects (github.com)

References:

  1. Nautilus reports 94% sales jump as pandemic fuels home-exercise trend, by Matthew Kish
  2. Apache Airflow – Open Source Workflow Management Tool
  3. PyTorch – Open Source Machine Learning Framework

Thanks for reading! If you want to read more of my work, view my Table of Contents.

If you’re not a Medium paid member, but are interested in subscribing to Towards Data Science just to read tutorials and articles like this, click here to enroll in a membership. Enrolling in this link means I get paid for referring you to Medium.


Related Articles