
Table of Contents
∘ Introduction ∘ Data ∘ Proposed Workflow ∘ AWS Cloud Components ∘ Collecting the Data (Lambda Function 1) ∘ Writing the Data to the Table (Lambda Function 2) ∘ Converting the data in CSV format (Lambda Function 3) ∘ The Final AWS Cloud Solution ∘ Visualizing the Data (Bonus) ∘ Advantages ∘ Limitations ∘ Conclusion
Introduction
A couple of months ago, I worked on a project in which I built a solar energy forecasting model aimed at aiding Germany’s transition to clean energy. The results were promising, but the model’s performance was limited by a crucial factor: the lack of weather data, which is a strong indicator of available solar energy.
The experience underscored a key lesson – real-time data is essential for practical forecasting models. Yet, collecting such data demands a robust data engineering solution.
Here, I propose a workflow that leverages AWS to build a high-performing and scalable data pipeline for real-time data collection. The tools and techniques applied here can also be applied to other real-time analytics challenges.
Data
The pipeline retrieves its data using OpenWeatherMap, an online service that provides forecasts and nowcasts for locations around the world.
OpenWeatherMap provides an API that enables quick and easy access to weather data. The API updates its nowcast every 10 minutes, meaning that the data pulls for a given location will be run at the same frequency.
For simplicity, the data’s scope is limited to the weather in Berlin, Germany.
Proposed Workflow

The data collection workflow aims to update the database with weather data in real time. Since the OpenWeatherMap API updates every 10 minutes, API calls are made at the same frequency, with the responses stored immediately in the database.
Whenever the data is needed for downstream tasks (e.g., visualization, machine learning), it is read, transformed into tabular format, and exported to a flat file.
Note: The following sections delve into the individual AWS services used to implement the workflow, including the configuration (e.g., code, logic) that makes them work as needed. If you want to skip ahead to the final solution and discussion, click here.
AWS Cloud Components
The workflow utilizes the following AWS components to ensure efficient data collection, processing, and storage:
- AWS DynamoDB
AWS DynamoDB offers the noSQL database needed to house the API responses, which are in a semi-structured format. The project creates a DynamoDB table named WeatherData
to serve this purpose.

The table uses city
as the partition key and timestamp
as the sort key. This arrangement allows data to be queried efficiently by place and time.
2. Amazon S3
The project uses a S3 bucket named [owm-data-bucket](https://us-east-1.console.aws.amazon.com/s3/buckets/owm-data-bucket?region=us-east-1)
to store file objects used throughout the project.
3. AWS Kinesis
The pipeline utilizes Amazon Kinesis to create a data stream named WeatherDataStream
that serves as an intermediary between the lambda function and the DynamoDB table.

4. Lambda functions
The workflow relies on 3 lambda functions to carry out the data collection and transformation tasks.
The first lambda function, named fetch_data
, pulls the data using the OpenWeatherMap API and sends it to the Kinesis stream. It is triggered every 10 minutes by an EventBridge rule.


The second lambda function, named write_to_ddb
, reads the data from the Kinesis stream and writes it to the WeatherData
table. It runs as soon as new data is added to the stream.


The third lambda function, named send_data_to_s3
, reads the data from the DynamoDB table and exports it as a CSV file to the owm-bucket
bucket. Unlike the first two functions, this one has no trigger and runs on demand.


5. AWS Eventbridge
AWS Eventbridge creates a schedule that automates the execution of the fetch_data
function every 10 minutes.

The next few sections focus on the code and logic used for each Lambda function.
Collecting the Data (Lambda Function 1)
The code for the fetch_data
function is shown below.
In short, the function uses the pyowm
library to access the newcast data in Berlin and the boto3
library to push the data to the Kinesis stream. I manually create the timestamp variable, which records the time of the program’s execution.
Writing the Data to the Table (Lambda Function 2)
The code for the write_to_ddb
function is shown below:
These two functions facilitate the data pulls and database writes. After running them every 10 minutes, the records become now visible in the WeatherData
table in DynamoDB.

Converting the data in CSV format (Lambda Function 3)
The send_data_to_s3
function reads all of the data in the WeatherData table, flattens it, and saves in the S3 bucket as a CSV file.
Unlike the first two lambda functions, this one is run on demand (i.e., whenever the tabulated data is needed for analysis). Below is a screenshot of the outputted flat file retrieved from the bucket.

The Final AWS Cloud Solution

The diagram above showcases how the individual AWS services interact with each other to power the weather data pipeline.
The first lambda function, run every 10 minutes by the Eventbridge schedule, collects data from OpenWeatherMap via API and sends it to the Kinesis stream. The second lambda function (triggered as soon as new data is added to the stream) takes the API responses from the stream and writes them to the DynamoDB table. The third function (run on user demand) reads and tabulates the data in the database and then stores the output in a bucket as a flat file. This file can be used for subsequent analyses or visualization.
Visualizing the Data (Bonus)
I built a Tableau dashboard to track Berlin weather using the flat file created above. For simplicity’s sake, I limited my visualization to a few metrics:
- Temperature over time
- Average % Cloud Coverage
- Average Duration of Daily Sunlight
- Average Wind Speed
The following is one way to display these metrics:

Here’s the link to the dashboard for those interested: https://public.tableau.com/views/BerlinWeatherTracker/Dashboard1?:language=en-US&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link
Advantages of the Solution
Leveraging AWS solutions allowed me to create a high-performing, resilient real-time data collection pipeline. Here are a few of the main advantages of the presented setup.
Note: This project only focused on data collection in Berlin, Germany. However, I evaluate the solution based on its ability to expand the data collection to every city in Germany (~2000 cities in total).
- Real-time Data Collection
The Lambda functions can handle the volume of API calls (~2000 calls per 10 minutes), ensuring continuous, real-time collection across Germany.
2. Serverless Architecture
The current solution is entirely serverless, meaning that users don’t need to create or manage resources manually to accommodate an increase in traffic. The lambda functions can scale automatically to handle traffic, while the DynamoDB table can automatically adjust its read-throughput.
3. Fault Tolerant
The use of the Kinesis stream as an intermediary layer ensures that data is not preserved during database downtime, minimizing data loss.
Limitations of the Solution
While the real-time data collection solution works well, I’d be hesitant to propose this to a client, due to its limitations. Here are a few that stand out to me:
- Dependence on OpenWeatherMap
The solution relies entirely on the OpenWeatherMap API, making it a single point of failure. Any issues like API rate limits and outages can disrupt the entire data collection process.
2. Cost
Although the solution only provisions necessary resources, services like Lambda, Kinesis, and DynamoDB charge every time they are used. Data collection at a larger scale would increase the cost exponentially, potentially making the solution infeasible.
This project (i.e., collecting data for Berlin alone) has cost about 50 cents per day. Thus, expanding the data collection to every German city (~2000 cities) would cost hundreds of dollars per day.
3. Limited Historical Data
The solution only collects nowcasts, rendering analyses requiring past data impossible. One possible fix could involve building a second pipeline that collects all historical weather records from OpenWeatherMap.
Conclusion

My previous project approached solar energy forecasting by focusing primarily on forecasting algorithms. However, this current project has deepened my appreciation for the crucial role Data Engineering plays in forecasting applications. For time-sensitive use cases like solar energy generation, real-time data collection and processing are essential.
If you have any ideas for improving the performance or reducing the cost of this solution, feel free to share them!
You can access the repository with the lambda functions here:
anair123/Berlin-Real-Time-Weather-Data-Collection-With-AWS
Also, if you’re curious about my energy forecasting project that inspired this project, you can check it out here:
Thank you for reading!