Tweet Preprocessing!

Basic Tweet Preprocessing in Python

Learn how to preprocess tweets using Python

Parthvi Shah

Published in

Towards Data Science

5 min readMay 19, 2020

https://hdqwalls.com/astronaut-hanging-on-moon-wallpaper

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

Just to give you a little background as to why I am preprocessing tweets: Given the current situation as of May, 2020, I am interested in the political discourse of the US Governors with respect to the ongoing pandemic. I would like to analyse how did the two parties — Republican & Democratic Party react to the given situation, COVID-19. What were their main goals at this time? Who focused more on what? What did they care about the most?

After collecting tweets from all the Governor’s of the states starting from Day 1 of Case-1 of the COVID-19 case, we merged them into a DataFrame (How to merge various JSON files into a DataFrame) and performed preprocessing.

We had a total of ~30,000 tweets. A tweet contains a lot of opinions about the data it represents. Raw tweets without preprocessing is highly unstructured and contains redundant information. To overcome these issues, preprocessing of tweets is performed by taking multiple steps.

Almost every social media site is known for the topic it represents in the form of hashtags. Particularly for our case, Hashtags played an important part since we were interested in #Covid19 ,#Coronavirus, #StayHome, #InThisTogether, etc. Hence, the first step was forming a separate feature based on the hashtag values and segmented them.

1. Hashtag Extraction using Regex

List of all hashtags added to a new column as a new feature ‘hashtag’

tweets[‘hashtag’] = tweets[‘tweet_text’].apply(lambda x: re.findall(r”#(\w+)”, x))

However, hashtags with more than one word had to segmented. We segmented those hashtags into n-words using the library ekphrasis.

#installing ekphrasis
!pip install ekphrasis

After it’s installation, I selected a segmenter built on twitter-corpus —

from ekphrasis.classes.segmenter import Segmenter#segmenter using the word statistics from Twitter
seg_tw = Segmenter(corpus=”twitter”)

The most relevant tweet-preprocessor I found — tweet-preprocessor, which is a tweet preprocessing library in Python.

It deals with —

URLs
Mentions
Reserved words (RT, FAV)
Emojis
Smileys

#installing tweet-preprocessor
!pip install tweet-preprocessor

2 . Text-Cleaning (URLs, Mentions, etc.)

Adding the cleaned (After removal of URLs, Mentions) tweets to a new column as a new feature ‘text’

Cleaning is done using tweet-preprocessor package.

import preprocessor as p#forming a separate feature for cleaned tweets
for i,v in enumerate(tweets['text']):
    tweets.loc[v,’text’] = p.clean(i)

3. Tokenization , Removal of Digits, Stop Words and Punctuations

Further preprocessing of the new feature ‘text’

NLTK (Natural Language Toolkit) is one of the best library for preprocessing text data.

#important libraries for preprocessing using NLTK
import nltk
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer

Remove Digits and lower the text (makes it easy to deal with)

data = data.astype(str).str.replace('\d+', '')
lower_text = data.str.lower()

Remove Punctuations

def remove_punctuation(words):
 new_words = []
 for word in words:
    new_word = re.sub(r'[^\w\s]', '', (word))
    if new_word != '':
       new_words.append(new_word)
 return new_words

Lemmatization + Tokenization — Used a built in TweetTokenizer()

lemmatizer = nltk.stem.WordNetLemmatizer()
w_tokenizer = TweetTokenizer()def lemmatize_text(text):
 return [(lemmatizer.lemmatize(w)) for w in \
                                     w_tokenizer.tokenize((text))]

The last preprocessing step is

Removing stop words — There is a pre-defined stop words list in English. However, you can modify your stop words like by simply appending the words to the stop words list.

stop_words = set(stopwords.words('english'))tweets['text'] = tweets['text'].apply(lambda x: [item for item in \ 
                         x if item not in stop_words])

4. Word Cloud

Frequency Distribution of the Segmented Hashtags

After the pre-processing steps, We excluded all the places names and abbreviations in the tweets because it acted as a leakage variable and then we performed a frequency distribution of the most occurring hashtags and created a word cloud —

This was quite expected.

from wordcloud import WordCloud#Frequency of words
fdist = FreqDist(tweets['Segmented#'])#WordCloud
wc = WordCloud(width=800, height=400, max_words=50).generate_from_frequencies(fdist)
plt.figure(figsize=(12,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

The final dataset —

The final code —

import pandas as pd
import numpy as np
import json
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import re, string, unicodedata
import nltk
from nltk import word_tokenize, sent_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
nltk.download
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer!pip install ekphrasis
!pip install tweet-preprocessor
import preprocessor as ptweets['hashtag'] = tweets['tweet_text'].apply(lambda x: re.findall(r"#(\w+)", x))for i,v in enumerate(tweets['text']):
    tweets.loc[v,’text’] = p.clean(i)def preprocess_data(data):
 #Removes Numbers
 data = data.astype(str).str.replace('\d+', '')
 lower_text = data.str.lower()
 lemmatizer = nltk.stem.WordNetLemmatizer()
 w_tokenizer =  TweetTokenizer()
 
 def lemmatize_text(text):
  return [(lemmatizer.lemmatize(w)) for w \
                       in w_tokenizer.tokenize((text))] def remove_punctuation(words):
  new_words = []
   for word in words:
      new_word = re.sub(r'[^\w\s]', '', (word))
      if new_word != '':
         new_words.append(new_word)
   return new_words words = lower_text.apply(lemmatize_text)
 words = words.apply(remove_punctuation)
 return pd.DataFrame(words)pre_tweets = preprocess_data(tweets['text'])
tweets['text'] = pre_tweetsstop_words = set(stopwords.words('english'))
tweets['text'] = tweets['text'].apply(lambda x: [item for item in \
                                    x if item not in stop_words])from ekphrasis.classes.segmenter import Segmenter# segmenter using the word statistics from Twitter
seg_tw = Segmenter(corpus="twitter")
a = []
for i in range(len(tweets)):
 if tweets['hashtag'][i] != a
  listToStr1 = ' '.join([str(elem) for elem in \
                                       tweets['hashtag'][i]])
  tweets.loc[i,'Segmented#'] = seg_tw.segment(listToStr1)#Frequency of words
fdist = FreqDist(tweets['Segmented#'])
#WordCloud
wc = WordCloud(width=800, height=400, max_words=50).generate_from_frequencies(fdist)
plt.figure(figsize=(12,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

Hope I helped y’all.

Text classification in general works better if the text is preprocessed well. Do give some extra time to it, it will all be worth it in the end.