A system design for scalable data pipeline for ingestion, NLP with storage, implemented using AWS.

I am a Data Scientist for a Manufacturing company and more often than not, I am tasked with building computational units, database schema and infrastructure for the data platform, where the ML model is deployed. More often than not, I am tasked to work with edge processing.
As a curious soul, I always wanted to build a scalable big data pipeline for an end-to-end Machine Learning model on the cloud. I usually deal with controller level data or sensor data which is numerical. I wanted to work on something that is different from what I daily work with. i.e. Text data.
I applied for the twitter developer API. After some days and some back and forth inquiry from Twitter, I was granted access to the developer API from Twitter. That means I can now use this API to get a stream of live tweets and query for historical tweets going back as far as a week.
The entire project can be found on Github.
Project Overview
Twitter API will be used to fetch tweet data from Twitter. The data is fetched with 15 minutes time bursts looking back at 100 tweets for every 15 minutes between specified timeframe. The tweet texts along with some user metadata is stored in the staging area. The tweet_id
and text
(tweet itself) is pushed to the Kinesis data stream for further processing.
Kinesis data stream is consumed and for each tweet, sentiment analysis is performed using AWS Comprehend ML as a service. The sentiment analysis API returns the overall sentiment of the text in Positive, Negative, Neutral or Mixed categories along with the final verdict as Label. The label is simply a string that represents the highest probability of all of the specified categories. A consumer consumes these results and writes the data to the Sentiment bucket.
Here, each Kinesis data stream consists of at most 100 tweets. Once the data is ingested and all Kinesis data streams are processed and Elastic Map Reduce (EMR) service is used to process the data. EMR is built to process data in a distributed manner using Hadoop, Spark framework with Hive in a fully managed clustered manner. We can ssh into master instance to perform operations on the cluster. The script for the EMR that uses pyspark
is first copied from the S3 bucket into the EMR master instance via AWS cli. EMR loads staging and sentiment data from the buckets and merges them using tweet_id
. It then performs aggregations based on 15 minutes time window to calculate average sentiments and total tweets as a percentage in each sentimental category in that timeframe. That way, the aggregated data can be easily visualized without performing aggregations on thousands of tweets every time a user wants to see the analytics. EMR stores all the tweets data with merged results in s3 bucket. The aggregated data is stored in a separate location (different s3 folder or different s3 bucket). This way, we have a transactional view of all tweets and their prevailing sentiments and a datamart with aggregated results with the time window which can be used to visualize the data with time as a parameter.
AWS Redshift is a fully managed data warehouse system where tables are made using SQL commands. These tables would hold the transactional and aggregated data stored in the bucket. To load the data stored in the s3 buckets to the Redshift data warehouse, COPY commands are used. A connection is made to the Redshift Cluster (SQL Workbench or Redshift cluster query editor) and COPY commands are performed over the cluster to pull data from the buckets and into the tables.
Grafana over on an EC2 instance is used for visualization.
System Architecture
The project runs on AWS exposing some AWS services:
- S3: For storage of staging data, sentiment data, tweet transactional data, aggregated data
- EC2: To run
ingest.py
andconsumer.py
. You can also run them on your own laptop. This is optional. - EMR Cluster: To process all data produced in batches. Merge data and run aggregation before storing them back to S3.
- Redshift Cluster: Ultimately data would end up in the Redshift.
- AWS Comprehend: In order to get sentiment from tweet text data, AWS Comprehend is used. This can be replaced with a local NLP model or an open source or enterprise version ML model with HTTP endpoint. Comprehend is used through
boto3
Python API here.

Here S3 is used for primary storage for each processing unit. AWS S3 provides a very cheap data storage capability for flat files with high availability and use-case flexibility.
Ingestion is kept decoupled from NLP workflow because NLP is a separate entity altogether. This way, ingestion is independent from ML workflow. This decoupling allows us to run more than one NLP services together in parallel from ingestion without delaying the data acquisition.
AWS Kinesis works as a temporary storage mechanism for faster retrieval for further downstream components and NLP. Once data is produced, it is consumed by the consumer. There can be more than one consumer based on the kind of analytics performed on the text and all of them would be independent with one another. Moreover, if the tasks fail, it would easier to run it again in complete isolation in terms of processing and storage.
EMR provides a fleet of high power EC2 Instances with a highly used distributed processing framework like Hadoop Spark. It has the capacity to perform data processing on Terabyte or Petabytes of data. EMR writes the data to S3 buckets rather than directly writing it to Redshift for several reasons. There can be a number of different sub-systems which would like to consume the processed and aggregated data. With S3 storage is extremely cheaper than Redshift, where we pay for space by the hour. Moreover, S3 read/writes are cheaper than Redshift reads where we pay for each request and its data packet size. Redshift’s primary goal is to provide a big picture of the data and be able to query historical data faster. Redshift’s data querying is much faster than S3. Hence, S3 is used to leverage cost when this system may be a part of a bigger architecture with many microservices.
Once the data is loaded into the Redshift databases, Data Visualization systems like Grafana can pull the data and visualize it.
Installation and Usage
I am a fan of docker as it makes my life easy for the deployment of several computational modules and I do not have to worry about the environment of the host. For this project, I have created a `docker-compose“. I would not go into much details on this segment as it is described in my github repo already.
Dashboard
I hosted the dashboard with grafana as it comes naturally to me to use grafana for ay project. It has a rich set of plugins and it goes well with timeseries data. I created a quick dashboard using it.
Provision a Linux/Windows machine on AWS. Make sure it has the access to Redshift Cluster. Install Grafana on it. Connect it to the Redshift cluster and start visualizing the data.
I ran the entire pipeline to collect around 33K tweets with query string biden
for 4 days.
The interactive dashboard below shows Percentage tweets for each category in the first row with gauge panels.
In the second row, I am showing a total number of tweets as a label and the tweets collected for each 15 mins interval over the timeline.
In the last row, I am showing a pie chart with average sentiment scores (NLP) for each category for the time period along with its timeline.

Conclusion
Ingestion works sequentially. Upon a fault, the system may be in a state where we have to run the ingestion again starting the last partition. It should be parallelized since previous loads are independent of current ongoing loads.
Possible Solution: A Scheduler like Airflow can be used to run partitioned parallel data ingestion.
Consumer works as a stand-alone piece of consuming unit. It works sequentially. It is heavily dependent on the third-party API (API which is not a part of this system). As a result, any regular delay in the third party API may result in a massive delay in consuming and with each consuming cycle, there can be an increment of data as the producer keeps on producing data.
Possible Solution: There should be a 1:1 relationship between producer and consumer data packages. AWS Kinesis Firehose works as a Data Stream Delivery system. A Lambda function can be used for each data stream produced which can handle the processing of each stream in a serverless manner.
Moreover, you can run ingestion with a scheduler like Airflow to regularly run ingestion, making it a regular batch process. With a slight change in the design, you can also make it work with real time tweet Sentiment Analysis, using spark stream.