Real-time analytics in the aviation industry - Part 1

Our challenges introducing real-time analytics over flight data to deliver valuable insights

Maikel Penz
Towards Data Science

--

Photo by Joseph Bradshaw on Unsplash

At Spidertracks we work to help make aviation safer by providing real-time aircraft tracking as well as transforming flight data into valuable insights. As part of the company’s journey to become a data-driven organisation I am currently involved in a project empowering customers to define what a typical flight looks like and allowing our platform the responsibility to trigger events - also referred to as safety events - if the unexpected were to occur.

This project comes with interesting problems to solve and I will focus this series of 2 articles on the technical aspect of the event generation engine: our requirements; picking a stream processing technology; challenges and limitations.

To start with, there are four pieces of the puzzle to be aware of:

Insights data pipeline
  • The data source, which is composed of flight metrics like aircraft attitude and location, are captured from the aircraft during the flight and ingested to AWS.
  • Customer configurations (eg. what is an acceptable pitch value under a given altitude during a rescue flight) are stored in a NoSQL database.
  • The event generation engine joins customer configurations with incoming flight metrics to decide whether or not an event must be generated.
  • Destination is where generated events are placed for consumption.

The data source was already in place before the project started and we had a clear understanding about how to store customer configurations in the database. The complexity of this project’s architecture comes from building the event generation engine.

The Event Generation Engine

Requirements

  • Low latency between data being ingested to AWS and events made available to our customers;
  • The engine must be stateful, for two main reasons:
    a) Events should only be generated if the current incoming row meets the configuration criteria and the previous row does not. This guarantees events are not generated when thresholds are already met.
    b) During a flight aircraft may not be under full coverage, which means data may be ingested in a sparse manner. However, the pipeline should behave the same way as if data was made available in real-time.
  • The architecture is future-proof, but we understand the product should be available to a small group of customers as quickly as possible to validate our hypothesis.

Selecting the stream processing technology: the options

With the requirements defined we started an investigation to identify what stream processing technology would fit our needs. From a technical standpoint we were looking for something that:

  • Runs on AWS
  • Enables us to deliver fast and iterate on our approach
  • Demands low maintenance
  • Is reliable

After some research we came across a few good candidates:

  • Spark Streaming: framework for near real-time stream processing with good documentation and community support. On AWS, we could run it through AWS Glue (serverless) or AWS EMR (managed cluster).
  • Apache Flink: real-time stream processing framework, which is an advantage compared to Spark Streaming, but in my opinion with weaker community support. Apache Flink also runs on AWS EMR (managed cluster) but its serverless offering on AWS goes through AWS Kinesis Analytics.
  • Kinesis Analytics for SQL: service to analyse streaming data in real-time using SQL. Some key differences from this service over the other two candidates are that Kinesis Analytics for SQL is not a framework but a cloud service and the stream processing is done through SQL rather than by using a programming language.

Kafka Streams was only a viable option if we were already running our ingestion pipeline through Kafka.

Selecting the stream processing technology: the decision

We decided to go with Kinesis Analytics for SQL. I will start by explaining why we discarded Spark Streaming and Apache Flink for our first iteration:

  • We realised that the learning curve to develop and troubleshoot Apache Flink applications is quite steep and it was not doable considering our deadline. To give more context to it, Apache Flink applications are primarily developed in Java and our stack has been built using Python. Pyflink is a library that enables python development with Flink but it is not available through a serverless service on AWS (Kinesis Analytics for Apache Flink only supports Java at the moment of writing this article)
  • Spark Streaming was a better fit - and we do have in-house experience with Apache Spark - but AWS’s serverless offering through AWS Glue doesn’t seem to be as cost-effective as running streaming applications through Kinesis Analytics.

Why Kinesis Analytics for SQL:

  • Serverless: with the deadline to deliver our first iteration in mind we understood our efforts should be put into the logic to generate events and not learning how to maintain the platform. This proved to be very important to us.
  • With good SQL experience in the team we felt that learning Streaming SQL was achievable;
  • Automatic scaling: Kinesis Analytics scales automatically to match the volume and throughput.

What is coming next ?

In Part 2 of this series I will dive into Kinesis Analytics for SQL: how it fits into our data pipeline, what surprises we had along the way and what the future of our event generation engine might look like.

--

--