Building a realtime dashboard with Flink: The Backend
With the demand for “realtime” low latency data growing more data scientists will likely have to become familiar with streams. One good place to start is Apache Flink. Flink is a distributed streaming framework that is built specifically for realtime data analysis. Unlike Spark, which runs in batches (even Spark “Streaming” is technically micro-batch), Flink is built on a streaming model (Spark vs. Flink is a long discussion that I will not go into here). Apache Flink can handle very low latency high throughput data. As such, an interesting use case of Apache Flink is to create a realtime data analytics dashboard.
For simplicity sake I have chosen to use the Twitter streaming API as our data source as other data sources often require deserialization schemas that further complicate things. In this first article I will show you how to preprocess Tweets with Flink and perform some basic aggregations/filtering. Then in the second article I will show you how to connect this with a D3.js frontend to make a realtime dashboard. Since there is a lot of code, I have decided to only include the most relevant excerpts, but you can find the GitHub repository with all the code at the bottom of the article.
In order to get setup we define our basic login credentials (you will need a Twitter developer account) and then set the execution environment. For more details about this you can see the official example code from Flink. The only exception is that here we will be using the Filter.java class as the FilterEndpoint. This will allow us to only get Tweets containing specific words. You can do this by declaring
Filter.FilterEndPoint i = new Filter(wordsList,locationsList,null);and then declaring
twitterSourceA.setCustomEndPointInitializer(i);. Where wordsList is a list of words that you want to include and twitterSourceA is the TwitterSource object that you declared with your credentials (as seen in the example). Optionally you can also filter by either location or user ids (the null value).
Apache Flink follows the same functional approach to programming as Spark and MapReduce. You have three major functions to work with Map, FlatMap, and Filter. First we will preform a FlatMap in order to get the relevant information. This part is fairly similar to the example with the exception that I also include a custom tokenizer (i.e. tokenize) to get rid of punctation and numbers. The code for the flatMap is below:
Next we want to filter out the stop words. Defining a filter for this purpose is quick and easy; we simply define a new filter function and then do
tweets.filter(aFilterFunction) see below for full details. Next we define a time window of ten seconds. The time window allows us to see how many occurrences of a word occur in a ten second window. This stream will then feed into sink which will update our charts every ten seconds (in step two).
Additionally, I chose to include one last filter in order to filter out words that appeared less than ten times in the window. Now we have a suitable stream that we can feed into our database and use to update our charts directly. However, there are still many design choices to be made with respect to the storage and display of the data.
In part two of this series I will discuss how to connect our new data stream to our frontend in order to create realtime updating charts as well as some architectural structures for handling realtime streams and achieving persistency.
Full code is available at the following link.