The world’s leading publication for data science, AI, and ML professionals.

How I Built a Real-Time Weather Data Pipeline Using AWS-Entirely Serverless

A practical guide to leveraging AWS Lambda, Kinesis, and DynamoDB for real-time insights

Photo by Brett Sayles: https://www.pexels.com/photo/clouds-1431822/
Photo by Brett Sayles: https://www.pexels.com/photo/clouds-1431822/

Table of Contents

IntroductionDataProposed WorkflowAWS Cloud ComponentsCollecting the Data (Lambda Function 1)Writing the Data to the Table (Lambda Function 2)Converting the data in CSV format (Lambda Function 3)The Final AWS Cloud SolutionVisualizing the Data (Bonus)AdvantagesLimitationsConclusion


Introduction

A couple of months ago, I worked on a project in which I built a solar energy forecasting model aimed at aiding Germany’s transition to clean energy. The results were promising, but the model’s performance was limited by a crucial factor: the lack of weather data, which is a strong indicator of available solar energy.

The experience underscored a key lesson – real-time data is essential for practical forecasting models. Yet, collecting such data demands a robust data engineering solution.

Here, I propose a workflow that leverages AWS to build a high-performing and scalable data pipeline for real-time data collection. The tools and techniques applied here can also be applied to other real-time analytics challenges.


Data

The pipeline retrieves its data using OpenWeatherMap, an online service that provides forecasts and nowcasts for locations around the world.

OpenWeatherMap provides an API that enables quick and easy access to weather data. The API updates its nowcast every 10 minutes, meaning that the data pulls for a given location will be run at the same frequency.

For simplicity, the data’s scope is limited to the weather in Berlin, Germany.


Proposed Workflow

Workflow (Created by Author)
Workflow (Created by Author)

The data collection workflow aims to update the database with weather data in real time. Since the OpenWeatherMap API updates every 10 minutes, API calls are made at the same frequency, with the responses stored immediately in the database.

Whenever the data is needed for downstream tasks (e.g., visualization, machine learning), it is read, transformed into tabular format, and exported to a flat file.

Note: The following sections delve into the individual AWS services used to implement the workflow, including the configuration (e.g., code, logic) that makes them work as needed. If you want to skip ahead to the final solution and discussion, click here.


AWS Cloud Components

The workflow utilizes the following AWS components to ensure efficient data collection, processing, and storage:

  1. AWS DynamoDB

AWS DynamoDB offers the noSQL database needed to house the API responses, which are in a semi-structured format. The project creates a DynamoDB table named WeatherData to serve this purpose.

DynamoDB table (Created by Author)
DynamoDB table (Created by Author)

The table uses city as the partition key and timestamp as the sort key. This arrangement allows data to be queried efficiently by place and time.

2. Amazon S3

The project uses a S3 bucket named [owm-data-bucket](https://us-east-1.console.aws.amazon.com/s3/buckets/owm-data-bucket?region=us-east-1) to store file objects used throughout the project.

3. AWS Kinesis

The pipeline utilizes Amazon Kinesis to create a data stream named WeatherDataStream that serves as an intermediary between the lambda function and the DynamoDB table.

Kinesis Data Stream (Created by Author)
Kinesis Data Stream (Created by Author)

4. Lambda functions

The workflow relies on 3 lambda functions to carry out the data collection and transformation tasks.

The first lambda function, named fetch_data, pulls the data using the OpenWeatherMap API and sends it to the Kinesis stream. It is triggered every 10 minutes by an EventBridge rule.

fetch_data Function (Created by Author)
fetch_data Function (Created by Author)
fetch_data Function Overview (Created by Author)
fetch_data Function Overview (Created by Author)

The second lambda function, named write_to_ddb, reads the data from the Kinesis stream and writes it to the WeatherData table. It runs as soon as new data is added to the stream.

write_to_ddb Function (Created by Author)
write_to_ddb Function (Created by Author)
write_to_ddb Function Overview (Created by Author)
write_to_ddb Function Overview (Created by Author)

The third lambda function, named send_data_to_s3, reads the data from the DynamoDB table and exports it as a CSV file to the owm-bucket bucket. Unlike the first two functions, this one has no trigger and runs on demand.

send_data_to_s3 Function (Created by Author)
send_data_to_s3 Function (Created by Author)
send_data_to_s3 Function Overview (Created by Author)
send_data_to_s3 Function Overview (Created by Author)

5. AWS Eventbridge

AWS Eventbridge creates a schedule that automates the execution of the fetch_data function every 10 minutes.

Event Schedule (Created by Author)
Event Schedule (Created by Author)

The next few sections focus on the code and logic used for each Lambda function.


Collecting the Data (Lambda Function 1)

The code for the fetch_data function is shown below.

In short, the function uses the pyowm library to access the newcast data in Berlin and the boto3 library to push the data to the Kinesis stream. I manually create the timestamp variable, which records the time of the program’s execution.


Writing the Data to the Table (Lambda Function 2)

The code for the write_to_ddb function is shown below:

These two functions facilitate the data pulls and database writes. After running them every 10 minutes, the records become now visible in the WeatherData table in DynamoDB.

WeatherData Table in DynamoDB (Created by Author)
WeatherData Table in DynamoDB (Created by Author)

Converting the data in CSV format (Lambda Function 3)

The send_data_to_s3 function reads all of the data in the WeatherData table, flattens it, and saves in the S3 bucket as a CSV file.

Unlike the first two lambda functions, this one is run on demand (i.e., whenever the tabulated data is needed for analysis). Below is a screenshot of the outputted flat file retrieved from the bucket.

CSV File (Created by Author)
CSV File (Created by Author)

The Final AWS Cloud Solution

AWS Architecture (Created by Author)
AWS Architecture (Created by Author)

The diagram above showcases how the individual AWS services interact with each other to power the weather data pipeline.

The first lambda function, run every 10 minutes by the Eventbridge schedule, collects data from OpenWeatherMap via API and sends it to the Kinesis stream. The second lambda function (triggered as soon as new data is added to the stream) takes the API responses from the stream and writes them to the DynamoDB table. The third function (run on user demand) reads and tabulates the data in the database and then stores the output in a bucket as a flat file. This file can be used for subsequent analyses or visualization.


Visualizing the Data (Bonus)

I built a Tableau dashboard to track Berlin weather using the flat file created above. For simplicity’s sake, I limited my visualization to a few metrics:

  • Temperature over time
  • Average % Cloud Coverage
  • Average Duration of Daily Sunlight
  • Average Wind Speed

The following is one way to display these metrics:

Dashboard (Created by Author)
Dashboard (Created by Author)

Here’s the link to the dashboard for those interested: https://public.tableau.com/views/BerlinWeatherTracker/Dashboard1?:language=en-US&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link


Advantages of the Solution

Leveraging AWS solutions allowed me to create a high-performing, resilient real-time data collection pipeline. Here are a few of the main advantages of the presented setup.

Note: This project only focused on data collection in Berlin, Germany. However, I evaluate the solution based on its ability to expand the data collection to every city in Germany (~2000 cities in total).

  1. Real-time Data Collection

The Lambda functions can handle the volume of API calls (~2000 calls per 10 minutes), ensuring continuous, real-time collection across Germany.

2. Serverless Architecture

The current solution is entirely serverless, meaning that users don’t need to create or manage resources manually to accommodate an increase in traffic. The lambda functions can scale automatically to handle traffic, while the DynamoDB table can automatically adjust its read-throughput.

3. Fault Tolerant

The use of the Kinesis stream as an intermediary layer ensures that data is not preserved during database downtime, minimizing data loss.


Limitations of the Solution

While the real-time data collection solution works well, I’d be hesitant to propose this to a client, due to its limitations. Here are a few that stand out to me:

  1. Dependence on OpenWeatherMap

The solution relies entirely on the OpenWeatherMap API, making it a single point of failure. Any issues like API rate limits and outages can disrupt the entire data collection process.

2. Cost

Although the solution only provisions necessary resources, services like Lambda, Kinesis, and DynamoDB charge every time they are used. Data collection at a larger scale would increase the cost exponentially, potentially making the solution infeasible.

This project (i.e., collecting data for Berlin alone) has cost about 50 cents per day. Thus, expanding the data collection to every German city (~2000 cities) would cost hundreds of dollars per day.

3. Limited Historical Data

The solution only collects nowcasts, rendering analyses requiring past data impossible. One possible fix could involve building a second pipeline that collects all historical weather records from OpenWeatherMap.


Conclusion

Photo by Alexas_Fotos on Unsplash
Photo by Alexas_Fotos on Unsplash

My previous project approached solar energy forecasting by focusing primarily on forecasting algorithms. However, this current project has deepened my appreciation for the crucial role Data Engineering plays in forecasting applications. For time-sensitive use cases like solar energy generation, real-time data collection and processing are essential.

If you have any ideas for improving the performance or reducing the cost of this solution, feel free to share them!

You can access the repository with the lambda functions here:

anair123/Berlin-Real-Time-Weather-Data-Collection-With-AWS

Also, if you’re curious about my energy forecasting project that inspired this project, you can check it out here:

Forecasting Germany’s Solar Energy Production: A Practical Approach with Prophet | by Aashish Nair | Towards Data Science

Thank you for reading!


Related Articles