The world’s leading publication for data science, AI, and ML professionals.

Deep learning pipeline for Natural Language Processing (NLP)

In this article, I will explore the basics of the Natural Language Processing (NLP) and demonstrate how to implement a pipeline that…

Practical implementation of NLP, unsupervised machine learning and deep learning concepts on unlabeled text data.

Photo by h heyerlein on Unsplash
Photo by h heyerlein on Unsplash

In this article, I will explore the basics of the Natural Language Processing (NLP) and demonstrate how to implement a pipeline that combines a traditional unsupervised learning algorithm with a Deep Learning algorithm to train unlabeled large text data. Hence, the main objective is going to be to demonstrate how to set up that pipeline that facilitates collection and creation of raw text data, preprocessing and categorizing an unlabeled text data to finally training and evaluating deep learning models in Keras.

After reading this tutorial, you will be able to perform the following:

  1. How to collect data from Twitter via Twitter API and Tweepy Python package
  2. How to efficiently read and clean up large text dataset with pandas
  3. How to preprocess text data using basic NLP techniques and generate features
  4. How to categorize unlabeled text data
  5. How to train, compile, fit and evaluate deep learning models in Keras

Find my Jupyter notebooks with Python code in my GitHub here.

Roll up your sleeves, we have a lot of work to do, let’s get started…..


Data

At the time of doing this project, the US 2020 election was just around the corner and it made sense to do sentiment analysis of tweets related to the upcoming election to learn about the kind of opinions and topics being discussed in Twitter world just about 2 weeks prior to the election day. Twitter is a great source for unfiltered opinions as opposed to the typical filtered news we see from the major media outlets. As such, we are going to build our own dataset by collecting tweets from Twitter using Twitter API and the python package Tweepy.


Step 1: Data collection

Prerequisites

Before getting started with streaming data from Twitter, you must have the following:

  1. Twitter account and Twitter API consumer keys (access token key, access token secret key, consumer key and consumer secret key)
  2. Tweepy package installed in your Jupyter notebook

Setting up a Twitter account and retrieving your Twitter API consumer keys is out of scope of this article. Should you need help with those, check out this post.

Tweepy could be installed via pip install in Jupyter notebook, the following one line code will do the trick.

# Install Tweepy
!pip install tweepy

Once installed, go ahead and import the package into your notebook.

1.1 Set up data streaming pipeline

In this section, I will show you how to set your data streaming pipeline with the use of Twitter API, Tweepy and a custom function. We can achieve this in 3 steps:

  1. Set up your Twitter API consumer keys
  2. Set up a Twitter API authorization handler
  3. Write a custom function that listens and streams live tweets
# Twitter API consumer keys
access_token = "  insert your key here  "
access_token_secret = "  insert your key here  "
consumer_key = "  insert your key here  "
consumer_secret = "  insert your key here  "
# Twitter API authorization
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
# Custom function that streams data from Twitter (20,000 tweets at most per instance)
class MyStreamListener(tweepy.StreamListener):
    """Function to listen and stream Twitter data"""
    def __init__(self, api=None):
        super(MyStreamListener, self).__init__()
        self.num_tweets = 0
        self.file = open("tweets.txt", "w")
def on_status(self, status):
        tweet = status._json
        self.file.write( json.dumps(tweet) + 'n' )
        self.num_tweets += 1
        if self.num_tweets < 20000: 
            return True
        else:
            return False
        self.file.close()
def on_error(self, status):
        print(status)

1.2 Start streaming live tweets

Now that the environment is set up, you are ready to start streaming live tweets from Twitter. Before doing that, identify some keywords that you would like to use to collect the relevant tweets of interest for you. Since I will be streaming tweets related to the US election, I have picked some relevant keywords such as "US election", "Trump", "Biden" etc.

Our goal is going to be to collect at least 400,000 tweets to make it a large enough text data and it is computationally taxing to collect all of that at one go. Thus, I will be setting up a pipeline in a way to efficiently stream data in chunks. Notice from the above custom function, it will listen and stream up to 20,000 tweets at most in each chunk. As such, in order to collect over 400,000 tweets, we will need to run at least 20 chunks.

Here is how the code looks for the chunk that listens and streams live tweets into a pandas DataFrame:

# Listen and stream live tweets
listener = MyStreamListener()
stream = tweepy.Stream(auth, listener)
stream.filter(track = ['US Election', 'election', 'trump', 'Mike Pence', 'biden', 'Kamala Harris', 'Donald Trump', 'Joe Biden'])
# Read the tweets into a list
tweets_data_path = 'tweets.txt'
tweets_data=[]
tweets_file_1 = open(tweets_data_path, 'r')
# Read in tweets and store in list: tweets_data
for line in tweets_file_1:
    tweet = json.loads(line)
    tweets_data.append(tweet)
# Close connection to file
tweets_file_1.close()
# Print the keys of the first tweet dict
print(tweets_data[0].keys())
# Read the data into a pandas DataFrame
names = tweets_data[0].keys()
df1 = pd.DataFrame(tweets_data, columns= names)

As mentioned, in order to collect 400,000 tweets, you will have to run the above code at least 20 times and save the collected tweets in separate pandas dataframe that will be concatenated later to consolidate all of the tweets into a single dataset.

1.3 Combine all of the data chunks into a single dataset

# Concatenate dataframes into a single pandas dataframe
list_of_dataChunks = [df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12, df13, df14, df15, df16, df17, df18, df19, df20]
df = pd.concat(list_of_dataChunks, ignore_index=True)
# Export the dataset into a CSV file
df.to_csv('tweets.csv', index=False)

Now that you have exported your combined dataset into a CSV file, you can use it for the next steps of cleaning and visualizing the data.

I have donated the dataset I have created and made it publicly available in Kaggle.


Step 2: Data wrangling

In this section, we will be cleaning the data we just collected. Before the data could be visualized, the dataset must be cleaned and transformed into a format that can be efficiently visualized. Given the dataset with 440,000 rows, one has to find an efficient way of reading and cleaning it. To do that, pandas chunksize attribute could be used to read data from CSV file into a pandas dataframe in chunks. Also, we can specify the names of columns we are interested in reading as opposed to reading the dataset with all of its columns. With the chunksize and smaller amount of columns of interest, the large dataset could be read into a dataframe rather efficiently and quickly without having to need for other alternatives such as distributed computing with PySpark on a cluster.

To transform the dataset into a shape required for visualization, the following basic NLP techniques will be applied:

  1. Extract the only tweets that are in English language
  2. Drop duplicates if any
  3. Drop missing values if any
  4. Tokenize (break the tweets into single words)
  5. Convert the words into lowercase
  6. Remove punctuations
  7. Remove stopwords
  8. Remove URLs, the word "twitter" and other acronyms

The approach I will follow to implement above steps is as follows:

  1. Write a custom function that tokenizes the tweets
  2. Write a another custom function that applies all of the above mentioned cleaning steps on the data.
  3. Finally, read the data in chunks and apply these wrangling steps via the custom functions to each of the chunks of the data as they get read.

Let’s see all of these in action…

# Function to tokenize the tweets
def custom_tokenize(text):
    """Function that tokenizes text"""
    from nltk.tokenize import word_tokenize
    if not text:
        print('The text to be tokenized is a None type. Defaulting to blank string.')
        text = ''
    return word_tokenize(text)
# Function that applies the cleaning steps
def clean_up(data):
    """Function that cleans up the data into a shape that can be further used for modeling"""
    english = data[data['lang']=='en'] # extract only tweets in english language
    english.drop_duplicates() # drop duplicate tweets
    english['text'].dropna(inplace=True) # drop any rows with missing tweets
    tokenized = english['text'].apply(custom_tokenize) # Tokenize tweets
    lower_tokens = tokenized.apply(lambda x: [t.lower() for t in x]) # Convert tokens into lower case
    alpha_only = lower_tokens.apply(lambda x: [t for t in x if t.isalpha()]) # Remove punctuations
    no_stops = alpha_only.apply(lambda x: [t for t in x if t not in stopwords.words('english')]) # remove stop words
    no_stops.apply(lambda x: [x.remove(t) for t in x if t=='rt']) # remove acronym "rt"
    no_stops.apply(lambda x: [x.remove(t) for t in x if t=='https']) # remove acronym "https"
    no_stops.apply(lambda x: [x.remove(t) for t in x if t=='twitter']) # remove the word "twitter"
    no_stops.apply(lambda x: [x.remove(t) for t in x if t=='retweet']) # remove the word "retweet"
    return no_stops
# Read and clean the data
warnings.filterwarnings("ignore")
use_cols = ['text', 'lang'] # specify the columns
path = 'tweets.csv' # path to the raw dataset
data_iterator = pd.read_csv(path, usecols=use_cols, chunksize=50000)
chunk_list = []
for data_chunk in data_iterator:
    filtered_chunk = clean_up(data_chunk)
    chunk_list.append(filtered_chunk)
tidy_data = pd.concat(chunk_list)

The chunksize in this case is 50,000 and that is how pandas will read 50,000 tweets in each chunk and apply the cleaning steps on them before reading the next batch and so on and so forth.

After this process, the dataset is going to be clean and be ready for visualization. To avoid performing the data wrangling steps each time you open up your notebook, you could simply export the tidy data to an external file and use that in the future. For the large dataset, it is more efficient to export them into JSON file as opposed to CSV.

# Explort the tidy data to json file for ease of use in the next steps
tidy_data.to_json('tidy_tweets.json', orient='table')

Here is how the tidy data looks like:

Step 3: Exploratory Data Analysis (Visualization)

Now that the data is clean, let’s visualize and understand the nature of our data. A few obvious things we can look at are as follows:

  1. Number of words in each tweet
  2. Average length of word in a tweet
  3. Unigram
  4. Bigram
  5. Trigram
  6. Wordcloud

It appears that the number of words in each tweet range from 1 to 19 words and on average falls between 10 to 12 words.

The average number of characters in a word in a tweet appear to range from 3 to 14 characters and on average occurring between 5 to 7 characters. People probably choose short words to express their opinions in the best way they can within the 280 character limit set by Twitter.

Unigram

As expected, the words "trump" and "biden" dominate the 2020 US election related tweets that were pulled between Oct 15 and Oct 16.

Bigram (most occurring pair of consecutive words)

Trigram (most occurring sequence of three words)

Wordcloud

From visualizing the data, notice that the words are not lemmatized. Lemmatization is a process of turning the words into their base or dictionary form. It is a common technique used in NLP and in Machine Learning in general. So in the next step, we are going to lemmatize the tokens with the following code.

# Convert tokens into format required for lemmatization
from nltk.corpus import wordnet
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
return tag_dict.get(tag, wordnet.NOUN)
# Lemmatize tokens
lemmatizer = WordNetLemmatizer()
tidy_tweets['lemmatized'] = tidy_tweets['text'].apply(lambda x: [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in x])
# Convert the lemmatized words back to the text format
tidy_tweets['tokens_back_to_text'] = [' '.join(map(str, l)) for l in tidy_tweets['lemmatized']]

Now, let’s save the lemmatized tokens into another JSON file to make it easy to use in the next step in the pipeline.

# Explort the lemmatized data to json file for ease of use in the next steps
tidy_tweets.to_json('lemmatized.json', orient='table')

Approach

Prior to undertaking the steps in the preprocessing and modeling, let’s review and be clear on our approach for the next steps in the pipeline. Before we can make predictions as to which category a tweet belongs to, we must first tag the raw tweets with categories. Remember, we streamed our data as raw tweets from Twitter hence the data didn’t come as labeled. Therefore, it is appropriate to implement the following approach:

  1. Label the dataset with k-means clustering algorithm
  2. Train deep learning models to predict the categories of the tweets
  3. Evaluate the models and identify potential improvements

Step 4: Labeling the unlabeled text data and preprocessing

In this section, the objective is going to be to tag the tweets with 2 labels corresponding to positive or negative sentiments. Then further preprocess and transform the labeled text data into a format that can be further used to train deep learning models.

There are many different ways to categorize unlabeled text data and such methods include, but not limited to, using SVM, hierarchical clustering, cosine similarity and even Amazon Mechanical Turk. In this example, I will show you another simpler, perhaps not the most accurate, way of doing a quick and dirty way of categorizing the text data. To do that, I will first conduct a sentiment analysis with VADER to determine whether the tweets are positive, negative or neutral. Next, I will use a simple k-means clustering algorithm to cluster the tweets based on the calculated compound value drawn from the values of how positive, negative and neutral the tweets are.

4.1 Create sentiments

Let’s look at the dataset first

The column "tokens_back_to_text" is the lemmatized tokens transformed back to the text format and I am going to be using this column from the tidy dataset for the creation of the sentiments with SenitmentIntensityAnalyzer from VADER package.

# Extract lemmatized text into a list
tweets = list(df['tokens_back_to_text'])
# Create sentiments with SentimentIntensityAnalyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
sid = SentimentIntensityAnalyzer()
sentiment = [sid.polarity_scores(tweet) for tweet in tweets]

Here is how the first 5 rows of the sentiments look like

4.2 Label the unlabeled data with k-means clustering

Now, I will use the column "compound" from the above dataframe and will feed it into a k-means clustering algorithm to categorize the tweets with 0 or 1 representing "negative" or "positive" sentiment respectively. That is, I will tag the tweets with the corresponding compound value of greater and equal to 0.05 as a positive sentiment while the value less than 0.05 will be tagged as a negative sentiment. There is no hard rule here, it is just how I set up my experiment.

Here is how you can implement a text labeling job with k-means clustering algorithm from scikit-learn in python. Remember to give the same index to both labels and the original dataframe where you have your tweets/texts.

# Tag the tweets with labels using k-means clustering algorithm
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(compound)
labels = pd.DataFrame(kmeans.labels_, columns=['label'], index=df.index)
Label counts
Label counts

Looking at the counts of labels for 0 and 1, notice that the dataset is imbalanced where more than twice the tweets are labeled as 1. This is going to impact the performance of the model, so we have to balance our dataset prior to training our models.

In addition, we could also possibly identify topics of the tweets from each of the categories with the help of a powerful NLP algorithm called "Latent Dirichlet Allocation" which could provide an intuition of topics in the negative and positive tweets. I will show this in a separate article at a later time. For now, let’s use the categories 0s and 1s for the sake of this exercise. So now, we have successfully converted our problem to a supervised learning problem and next we will proceed onto training deep learning models using our now labeled text data.


Step 5: Modeling

We have a pretty large dataset with over 400,000 tweets with more than 60,000 unique words. Training RNNs with multiple hidden layers on such a large dataset is computationally taxing and may take days (if not weeks) if you attempted to train them on a CPU. One common approach for training deep learning models is to use GPU optimized machines for higher training performance. In this exercise, we are going to use Amazon SageMaker p2.xlarge instance that comes pre-loaded with TensorFlow backend and CUDA. We will be using Keras interface to the TensorFlow.

Let’s get started, we will apply the following steps.

Training steps

  1. Tokenizing, padding and sequencing the dataset
  2. Balance the dataset with SMOTE
  3. Split the dataset into training and test sets
  4. Train SimpleRNN and LSTM models
  5. Evaluate models

The dataset must be transformed into a numerical format as machine learning algorithms do not understand natural language. Before vectorizing the data, let’s look at the text format of the data.

tweets.head()

5.1 Tokenize, pad and sequence the dataset

# prepare tokenizer tokenizer = Tokenizer() tokenizer.fit_on_texts(tweets)
# integer encode the documents
sequences = tokenizer.texts_to_sequences(tweets)
# pad documents to a max length of 14 words maxlen = 14 X = pad_sequences(sequences, maxlen=maxlen)

5.2 Balance the imbalanced data with SMOTE

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
# define pipeline
over = SMOTE(sampling_strategy=0.5)
under = RandomUnderSampler(sampling_strategy=0.8)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)
# transform the dataset
X, y = pipeline.fit_resample(X, labels['label'])
# One-hot encoding of labels
from keras.utils.np_utils import to_categorical
y = to_categorical(y)

As seen from the distribution of data between 0 and 1 from above, the data now looks to be pretty balanced compared to what it was before.

5.3 Split the data into training and test sets

Now that the data is balanced, we are ready to split the data into training and test sets. I am going to put away 30% of the dataset for testing.

# Split the dataset into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=43)

5.4 Training the RNNs

In this section, I will show you how to implement a couple of variants of RNN deep learning architecture, 3 layer SimpleRNN and 3 layer LSTM architectures. The activation function is set to "tanh" for both SimpleRNN and LSTM layers by default, so let’s just leave it at its default setting. I will use all the 65,125 unique words as the size of the vocabulary, will limit the maximum length of each input to 14 words as it is consistent with the maximum length of word in a tweet and will set the output dimension of the embedding matrix to 32.

SimpleRNN

The dropout layers will be used to enforce regularization terms to control overfitting. As my dataset is labeled in binary class, I will use binary crossentropy as the loss function. In terms of an optimizer, Adam optimizer is a good choice and I will include accuracy as the metric. I will run 10 epochs on the training set in which 70% of the training set will be used to train the model while the remaining 30% will be used for validation. This is not to be mixed up with the test set we kept away.

# SimpleRNN
model = Sequential()
model.add(Embedding(input_dim = vocab_size, output_dim = output_dim, input_length = maxlen, embeddings_constraint=maxnorm(3)))
model.add(SimpleRNN(output_dim=output_dim, return_sequences=True, kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(SimpleRNN(output_dim=output_dim, return_sequences=True, kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(SimpleRNN(output_dim=output_dim))
model.add(Dense(2,activation='softmax'))

Model summary as follows:

The SimpleRNN model results are shown as follows – 10 epochs:

LSTM

3 layer LSTM model will be trained with dropout layers. I will run 10 epochs on the training set in which 70% of the training set will be used to train the model while the remaining 30% will be used for validation. This is not to be mixed up with the test set we kept away.

# LSTM
model.add(Embedding(input_dim = vocab_size, output_dim = output_dim, input_length = maxlen, embeddings_constraint=maxnorm(3)))
model.add(LSTM(output_dim=output_dim, return_sequences=True, kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(LSTM(output_dim=output_dim, return_sequences=True, kernel_constraint=maxnorm(3)))
model.add(Dropout(0.2))
model.add(LSTM(output_dim=output_dim, kernel_constraint=maxnorm(3)))
model.add(Dense(2,activation='softmax'))

Model summary is shown as follows:

LSTM model results are shown as follows – 10 epochs:


Step 6: Model evaluation

Now, let’s plot the models’ performances over time and look at their accuracies and losses across 10 epochs.

SimpleRNN: Accuracy

SimpleRNN: Loss

Notice the training accuracy, the SimpleRNN model quickly starts overfitting and the validation accuracy has high variance for the same reason.

LSTM: Accuracy

LSTM: Loss

As seen from the accuracy and loss plot of LSTM, the model is overfitting and the validation accuracy not only has high variance but also dropping quickly for the same reason.

Conclusion

In this project, I attempted to demonstrate how to set up a deep learning pipeline that predicts the sentiments of the tweets related to the 2020 US election. To do that, I first created my own dataset by scraping raw tweets via Twitter API and Tweepy package.

Over 440,000 tweets were streamed via Twitter API and stored into a CSV file. After wrangling and visualizing the data, a traditional clustering algorithm, k-means clustering in this case, was used to tag the tweets with two different labels, representing positive or negative sentiments. That is, the problem was converted into a supervised learning problem before training the deep learning models with the data. Then the dataset was split into training and test sets.

Later, the training set was used to train SimpleRNN and LSTM models respectively and were evaluated using the loss and accuracy curves from the model performances in each epoch. Overall, both models appear to be performing as they should and are likely to be overfitting the data according to what is seen from accuracy plots and as such I suggest the following recommendations for the next step.

Recommendations:

  • Find another approach or different learning algorithm to label the dataset
  • Try Amazon Mechanical Turk or Ground Truth to label the data set
  • Try different RNN architectures
  • Perform more advanced hyperparameter tuning of the RNN architectures
  • Perform cross-validation
  • Make the data multi-class problem

Skills Practiced During This Project/Tutorial:

  1. How to efficiently collect data from Twitter via Tweepy and Twitter API
  2. How to efficiently work with large dataset
  3. How to build deep learning architectures, compile and fit in Keras
  4. How to apply basic NLP concepts and techniques to a text data

Again, find my Jupyter notebooks with Python code in my GitHub here and let’s connect on LinkedIn.

Enjoy deep learning 🙂 I’d love to hear your feedback and suggestions, so please either use the clap button or comment in the section below.

Thank you!

Online References and Useful Materials:


Related Articles