Toxic Comment Classification using LSTM and Deployment using AWS EC2

A step-by-step guide to deploy a Deep Learning model as a web application on AWS EC2.

Shaunak Varudandi
Towards Data Science

--

Photo by Paulo Silva on Unsplash

Introduction

Online forums and social media platforms have provided individuals with the means to put forward their thoughts and freely express their opinion on various issues and incidents. In some cases, these online comments contain explicit language which may hurt the readers. Comments containing explicit language can be classified into myriad categories such as Toxic, Severe Toxic, Obscene, Threat, Insult, and Identity Hate. The threat of abuse and harassment means that many people stop expressing themselves and give up on seeking different opinions.

To protect users from being exposed to offensive language on online forums or social media sites, companies have started flagging comments and blocking users who are found guilty of using unpleasant language. Several Machine Learning models have been developed and deployed to filter out the unruly language and protect internet users from becoming victims of online harassment and cyberbullying.

The following article can be regarded as Part 2 of the Toxic Comment Classifier project, and this article aims to elaborate on the steps needed to successfully deploy a Deep Learning model on AWS EC2. Readers of the following article who haven't read Part 1 of the Toxic Comment Classifier project can do so via the following link.

Problem Statement

“To develop a web application which accurately calculates the toxicity of a statement that has been provided as an input by the user.”

Work Flow

Part 1 of the Toxic Comment Classifier project helped me compare the performance of two different Deep Learning architectures, LSTM and LSTM-CNN. It gave me a good basis and the necessary evidence to conclude that the LSTM model was the right Deep Learning model for my use case and therefore, I chose to deploy it as a backend to a web application on AWS EC2. Additionally, for the front-end of the web application, I decided to choose Gradio. A wonderful library that provides all the necessary tools to create a responsive front-end for any Machine Learning or Deep Learning Model.

Step 1: Creating a directory structure for the entire project.

Before I proceeded with deployment, I finalized the directory structure of my project and divided the complete work done in Part-1 of the Toxic Comment Classifier project into separate python scripts. I will provide a brief description of each python script as well as elaborate upon the purpose that it serves.

  • config.py — The first script in the source folder is config.py. This python script contains the relative paths to all the data files, saved model files, and fastText’s word embedding file. I also defined the toxicity classes here which the Deep Learning model will classify a comment into, and some important training parameters like no. of epochs, batch size, etc.
  • data_cleaning.py — As the name suggests, the data_cleaning.py script assists in data cleaning and data normalization. The following python script converts data into lower-case characters and removes punctuations, whitespaces, spaces in between words, “\n”, emojis, non-English characters, and numbers from the data. The following script is used during model training and plays a crucial role in our web application as it ensures that clean data is sent to the LSTM model for toxicity classification.
  • data_preprocessing.py — The following script file is bestowed with performing two crucial tasks. Firstly, to convert the processed clean data into sequence vectors and save it as a .pickle file. Second, to use fastText’s word embeddings and create the embedding layer for our deep learning model.
  • model_training.py — With the data now clean and ready to be used for training, the model.py script file is used for training our LSTM model. Additionally, we also use this python script to save our LSTM model and later use it for classifying the toxic comments submitted on our web application.
  • website.py — Present in the website folder, the website.py script hosts the code for the front-end of our web application as well as relative paths to the tokenizer file and the saved model file. I have also imported all the components from the data_cleaning.py script to assist in data cleaning and data normalization when a comment is submitted for classification on the web application.

Step 2: Saving the sequence vectors and the trained LSTM model.

Before we proceed with the deployment process, we must save the sequence vectors and the LSTM model as it will help perform toxic comment classification on the hosted web application. To achieve this task, I executed the python script “model_training.py” and switched the do_load_existing_tokenizer flag in the “data_preprocessing.py” script to False. This ensured that a new file is created which saves the sequence vectors. Note: Once the sequence vectors are created and saved, switch the do_load_existing_tokenizer flag back to True. The “model_training.py” script also helps save the trained LSTM model in a .h5 format at the location defined in the “config.py” script.

Step 3: Model Deployment.

Having created and generated all the requisite files for the deployment process, I downloaded the necessary applications needed to complete the deployment, created an account on AWS, and lastly, deployed my application using AWS EC2 instance. Every step followed to achieve successful deployment of my LSTM model will be elaborated in the upon below:

1 - Create an account on Amazon Web Services and log into your account. As soon as you log in, search for EC2 in the search bar that appears on top of the AWS Management Console. Upon choosing “EC2”, you are redirected to the EC2 management console wherein you can click on “Instances” from the Resources tab.

(Image by Author)
(Image by Author)

2 - Clicking on Instances redirects you to a page that contains all your EC2 instances which are either running or stopped. If it’s your first time deploying a Machine Learning or Deep learning Model on EC2, you can create a new instance by clicking on “Launch Instance”, situated on the top-right corner of the screen.

(Image by Author)

3 - The first step in an instance creation is to select the desired OS for your EC2 instance, in my case, Ubuntu Server 18.04 LTS (HVM). Next, select the instance type. t2.micro in my case, as it is free tire eligible. Finally, you can click on ”Review and Launch”, situated in the bottom right corner of your screen, and then click on “Launch”.

(Image by Author)
(Image by Author)
(Image by Author)

4 - A window will now pop up which will give you the option to either use the existing key-pair or create a new key-pair. I will choose the option to “create a new key pair”, give it a name, and eventually download the key pair. The key pair is downloaded in a .pem file format. After this is done, click on “Launch Instances” and lastly click on “View Instances”.

(Image by Author)
(Image by Author)

5 - While the instance is getting created, we download two important applications PuTTY and PuTTYgen. PuTTY helps you connect with the Ubuntu server that we just created and PuTTYgen helps create a private key from the .pem file that we downloaded in the previous step.

6 - Install and Open PuTTYgen, locate your .pem file, and from inside PuTTYgen “load” the .pem file. Next, click on “Save Private Key” and then click on “Yes”. Give a name to your Private Key and “Save” it.

7 - Next, download another application known as WinSCP. This software is important as it lets us connect to the EC2 instance, and later drag and drop the required web application files into our EC2 instance.

8 - Install and Open WinSCP. In the Login prompt, we need to first enter the hostname. Host-name can be retrieved by selecting your newly created instance and copying the Public IPv4 DNS that is located in the Details tab that opens up in the bottom half of your screen. Copy the hostname and paste it into the Login Prompt. For the user name, type in ubuntu and then click on “Advanced”. Choose “Authentication” from the menu that appears on the left-hand side. Inside “Authentication Parameters” upload the private key file which you generated using PuTTYgen and press “Ok”. Lastly, click on “Login” and then select “Yes”.

(Images by Author)

9 - As soon as you gain access to your ubuntu server, you can drag and drop the required files for deploying the web application. In my case, I transferred website.py, data_cleaning.py, config.py, tokenizer.pickle, toxcicity_classifier.h5, and requirements.txt file.

(Image by Author)

10 - Once the above operation is complete, Install and Open PuTTY. PuTTY helps us communicate with our ubuntu server and allows us to install all the requisite libraries that ensure the successful working of our web application. To connect with our ubuntu server, we need to enter the Host-name (Public IPv4 DNS) and give the following session a name. Once this is done, navigate to the “SSH” option present in the side menu, expand it and click on “Auth”, Browse”, and upload your private key file. Navigate back to“Session” using the side menu and “Save” your session. Lastly, click on “Open” and you will be able to access the command prompt of your ubuntu server.

(Images by Author)

11 - Enter the username as ubuntu and you can now write commands that perform actions in your ubuntu server. Before we proceed with installing the requisite libraries, we need to install pip. Type the command mentioned below in the PuTTY prompt and let it do its work.

sudo apt-get update && sudo apt-get install python-3 pip
(Images by Author)

12 - There’s one crucial step that is remaining, that is related to the configuration of your EC2 instance. We need to make sure that this instance is accessible from anywhere. To achieve this, navigate to “Network & Security” from the side menu of your EC2 Management Console and click on “Security Groups”. Next, click on “Create Security Group”. Give it a Name and a Description and later click on “Add rule” to add a new Inbound Rule. Choose Type as All Traffic and Source as Anywhere and lastly, click on “Create Security Group”. Now, you can navigate back to your Instances, choose the instance that we created, right-click and inside “Security”, choose the option to “Change Security Groups”. Search for the Security Group that you just created and click on “Add Security Group”. Do not forget to “Save” these settings.

(Image by Author)

13 - Switching back to PuTTY, I have now installed the necessary pip3 dependencies. Now run the below command to install all the requisite libraries.

pip3 install -r requirements.txt

14 - Once the requisite libraries have been installed, run the below command in PuTTY and the web application should start running on the 8080 port.

python3 website.py

15 - Now fetch the Public IPv4 DNS of your EC2 instance, copy it in your browser, append:8080 at the end of the Public IPv4 DNS, and hit enter. The web application will now start running on your browser which is a clear indication that our application has been successfully deployed on AWS EC2.

Warning: Once you are done testing/ playing around with your deployed web application, it is always ideal to Terminate your EC2 instance. This prevents the risk of incurring unnecessary service charges. To achieve this, select your instance, right-click on it and choose Terminate Instance from the menu that pops up.

Conclusion

Deploying the LSTM model using a cloud technology, AWS EC2 in my case, helped me complete an end-to-end Deep Learning project. In the process, I came across Gradio, a wonderful library that helps create a minimal effort front-end for your Deep Learning application. Essentially, I acquired knowledge of the key steps that are needed to successfully deploy an application onto a cloud server and ensure that it runs without encountering any bugs or glitches. The process might seem time-consuming or a hard slog, but in my opinion, the satisfaction of deploying your Machine Learning or Deep Learning model on the cloud and seeing it giving out the desired output is unparalleled.

All the requisite project files to complete this end-to-end project can be found on my Github profile. I hope you enjoyed reading my blog.

--

--

MBA (Co-op) student at the DeGroote School of Business || Aspiring Business Data Analyst.