BigData/Streaming: Amazon S3 Data Lake | Storing & Analyzing the Streaming Data on the go (in near real-time) | A Serverless Approach

Making an Amazon S3 data lake by storing the streaming data & analyzing it on the go in near real-time…

Burhanuddin Bhopalwala

Published in

Towards Data Science

8 min readFeb 4, 2020

S3 data lake serverless architecture | photo-credit: aws.amazon.com

What is streaming data and its main challenges (3V’s)?
What is the serverless approach and Why should we use a serverless approach?
Prerequisites — AWS + AWS Kinesis Firehose Delivery Stream+ AWS Kinesis Producer/Consumer + AWS Lambda + AWS S3 Storage + AWS Athena — Bird’s-eye view
AWS Kinesis Delivery Stream Setup — Step by Step
BONUS — Pro Tips!

1: What is streaming data and its main challenges (3V’s)?

Streaming data is simply meant a continuous flow of data. And today in the age of the Internet, Devices like Smart Phones, Watches, GPS Sensors are popular sources for streaming data.

Note: An ecosystem created by all of these Devices by connecting each other over the Internet is what we called as Internet of Things (IoT).

3 main challenges with streaming data (3V’s):

1: Velocity (Throughput/min): From Gigabytes/min (GB/min) to Terabytes/min (TB/min). Consuming streaming data at this velocity without any loss of information is always not an easy task and is non-trivial.

2: Variety (Types): If the data is unstructured(JSON/audio/video format), then it becomes more difficult to analyze such data especially when the data is in flight (will cover in future articles) as compared to when the data in rest!

3: Volume (DB Size): From Terabytes (TBs) to Petabytes(PBs) to Exabytes (EBs), storing streaming data requires lots of space and again scanning such raw data to query over it also becomes a challenging task.

Tip: These 3V’s are the key parameters that decide the data category (small, medium or big data) and hence also play a key role in deciding the type of Database storage solution we need to store the data. Confused? See below:

Difference between small, medium & big-data based on 3V’s

2: What is the serverless approach and Why should we use a serverless approach?

Let me start by clearing the common myth & confusion after reading the word “serverless” — serverless does not mean performing computation without servers. Simply means delegating the responsibility for managing the Servers to the Cloud Service Providers (AWS/Google Cloud Platform (GCP)/Microsoft Azure), so that we can always focus on business logic!

So, why should we use a serverless approach?

I have already mentioned the 3 main challenges with the streaming data. It requires not only lots of Team effort but constant maintenance. Auto Scaling/Elasticity is also not very trivial. Eventually more cost!

But, In serverless, it’s the opposite, We require minimal maintenance. Cloud Service Providers will auto-scale for us and eventually way less maintenance & cost!

Note: In BONUS — Pro Tips: I will also share, how to configure the delivery streams so it cost as minimum as possible.

3: Prerequisites — AWS + AWS Kinesis Firehose Delivery Stream + AWS Kinesis Producer/Consumer + AWS Lambda + AWS S3 Storage + AWS Athena — Bird’s-eye view

AWS Kinesis Delivery Stream — Lambda — S3 — Athena(Data Analysis) | photo-credit: SDS

AWS: Amazon Web Services (AWS) is the cloud provider we are using. One can use Google Cloud Platform (GCP) or MS Azure for respective services.

AWS Kinesis Firehose Delivery Stream: Kinesis is nothing but a managed (serverless) Apache Kafka. AWS usually has 2 options for using Kinesis. Kinesis Data Stream (for real-time) & Kinesis Firehose Delivery Stream which is a near-real-time (~60-sec delay) service. This blog, will use AWS Firehose Kinesis delivery streams going ahead.
AWS Kinesis Producer: AWS Kinesis Producer SDK (prefer for high performance) or AWS Kinesis Agent are 2 popular ways for shipping the data to AWS Kinesis.
AWS Kinesis Consumer: If You want to consume the data apart from storing the data, You can do that using AWS Kinesis Consumer SDK / AWS Kinesis Client Library (KCL uses AWS DynamoDB under the hood) or even AWS Kinesis Connector Library if you are JAVA lover.

Kinesis uses shards same as for Apache Kafka for data distribution | photo-credit: SDS

AWS Lambda: We will transform our data streams to record in-flight using AWS Lambda. A virtual function from AWS that we will use as a service a.k.a `function-as-a-service`. It’s easy to use and cost-efficient. https://aws.amazon.com/lambda/
AWS S3: We will use AWS S3 Service for our data lake. It’s one of the most simple, reliable, and cost-efficient AWS services.
AWS Athena: For analyzing the data stored in AWS S3, it will use AWS Athena which is generally used for AWS S3 analytical and ad-hoc queries.

4: AWS Kinesis Firehose Delivery Stream setup — Step by Step

Step 1: AWS Kinesis Producer: I am using here AWS Kinesis Agent, as in my example the data is consuming directly onto the file. AWS Kinesis Agent will directly transfer these file Objects to S3 via AWS Kinesis Delivery Streams.

Kinesis Agent needs to be installed where you are receiving the streaming data or generating the log data.

$ sudo yum install -y aws-kinesis-agent
...
1081 packages excluded due to repository priority protections
Package aws-kinesis-agent-1.1.3-1.amzn1.noarch 
installed and on latest version$ cd /etc/aws-kinesis
$ sudo vim agent.json

Go to agent.json from the above command and place Your IAM credentials, & the location of the server where you are receiving the streaming or generating the log data. Below you can find the agent.json file:

You can also do the same using AWS Kinesis SDK Library and the code for the same using Python you can find it here:

AWS Kinesis — Lambda record transformation

Note: I have used JavaScript for Kinesis Producer Library example using npm aws-kinesis-producer

Step 2: Let’s set up AWS S3 now. We just need to create an AWS S3 Bucket. While you can AWS S3 Bucket, you can choose default S3 Server Side Encryption (S3-SSE) also for Encryption at rest.

Step 3: Now we are setting up AWS Kinesis Delivery Streams. You can find the options below. Partitioning is a must so that Athena can be able to SCAN the data faster. I am partitioning the data below hours-wise.

If you want to use AWS Lambda for records transformation in flight, You can do that also. You can find the manual transformation code below:

Note: Standard records transformation to JSON is also available. You can find it here: https://github.com/aws/aws-lambda-java-libs

Step 4: Finally, AWS Athena for querying over our AWS S3 data lake, excited! Below attaching both the query for making a Database in Athena over S3 data and select a query for reading the data.

AWS Athena — For creating the Database from AWS S3

If your data is already partitioned inside S3, You can also use existing S3 partition inside Athena, by typing the following ALTER command in Athena console.

AWS Athena — ALTER command over S3 Partitioning

Final Results:

Partitioned data — S3 via AWS Kinesis Firehose Delivery Streams

$ aws s3 cp s3://<bucket-key-name>/<sub-bucket-key-name>/dt=2020–01–20–08/ . — recursive | xargs -rn 1 gzip -d *data-1–2020–01–20–08–10–00–04453c-3e98–47a3-b28d-521ae9ff9b3d.log
data-1–2020–01–20–08–10–15–04453c-3e98–47a3-b28d-521ae9ff9b3d.log
data-1–2020–01–20–08–10–30–04453c-3e98–47a3-b28d-521ae9ff9b3d.log
data-1–2020–01–20–08–10–45–04453c-3e98–47a3-b28d-521ae9ff9b3d.log

AWS Athena — Results after query over S3 data

Note: You might be thinking, How AWS Athena crawls & scan the S3 Data? It uses AWS Glue Data Crawler (similar to Extract Transform Load ( ETL) job). It does all the heavy lifting under the hood. https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html

That’s it. It’s too easy…

Mention: Athena also allows to invoking Machine Learning using SQL Queries. RANDOM_CUT_FOREST (for outliers), HOTSPOTS (for finding the dense area) are popular for that. That’s easy too…

5: As promised initially, BONUS — Pro Tips!

Performance: Always prefer to use AWS Kinesis SDK libraries for high performance & throughput. It supports batching also.
Performance: If you want to consume the data with AWS Kinesis Delivery Stream, You can use “Kinesis-Enhanced Fanout Streams” which will allow consuming the data with just double the capacity of a shard i.e. from 1 MB/s to 2 MB/s.
Cost: Use GZIP compression for reducing the size of the Object. While restoring just use the “gzip -d <s3-file-keyname>” command to get the data in the raw format again. GZIP will help you to compress the size by 75% and hence You will end up saving up to 75% of the S3 cost. Usually, AWS S3 Costs about 0.03$/GB.

Note: GZIP is known for large compression ratios, but poor decompression speeds and high CPU usage as compared to ZIP format. GZIP usually preferred!

Cost: Use AWS S3 Life Cycle Rules — AWS S3 stores each object by default on a Standard Zone (means frequent-access zone). AWS S3 Life Cycle Rules will move the Objects to Standard-IA (Infrequent Access) over time which are again 30% cheaper than the Standard S3 Zone.
Security: Always enable S3-Server Side Encryption (SSE) for security. For even more sensitive data, You can also use S3-Client (SSE-C), In which, you can pass your encryption key over HTTPS.

Thanks for reading. I hope you will find this blog helpful. Please stay tuned for more such upcoming blogs on cutting-edge Big Data, Machine Learning & Deep Learning. Stay tuned! The best is yet to come :)

That’s pretty much of it!

Connect 🤝:

Email: bbhopalw@gmail
Linkedin: www.linkedin.com/in/bbhoaplw

For further reading ✍️:

Big Data & Cloud Engineering blogs:

Towards Data Science Publication: https://medium.com/@burhanuddinbhopalwala

Backend Engineering & Software Engineering blogs:

DEV Community:
https://dev.to/burhanuddinbhopalwala