Scraping tweets and analyzing Social Sentiments

8 min readNov 23, 2018

Natural Language Processing (NLP) is really a very interesting and broad field of Artificial Intelligence. Here I am going to use it for processing of text-based record and will give you a crash course on Scraping and Sentiment Analysis. For Scraping, I have used Selenium & tweepy and for Sentiment Analysis, I have used NLTK Classes and Methods and Naive Bayes Model. I have tried my best to cover most of the steps that should be performed while working on text data set and let me ensure you that It will be worth your time.

So what is Scraping and Sentiment Analysis?

Scraping — A process of getting small fragments of something. In our case, it is web scraping, so here we are taking fragments of information available on a website.

Sentiment Analysis — You can deduce from the term itself that it is the process of analyzing views or opinions of people on any subject. Now the subject can be anything a Product, Movie, Political or Social Issue, Technology, any Event or some kind of Trend.

People usually prefer Social Media to express their views or opinions, it can be Facebook, Twitter, Quora or any other blogging site. In this tutorial, I am going to consider Twitter as a source of information, and the subject I would like to choose is ‘AI and Deep Learning’, though the code that I will be sharing will be completely generic, so you can choose any other interesting topic as well.

From the title and above description, you must have identified that the text data that is required to perform Sentiment Analysis needs to be scrapped from Twitter. So below are the major operation that I am going to perform:

1. Scraping Tweets
2. Identifying Sentiments
3. Text Pre-processing
4. Feature Extraction
5. Model Building

Scraping Tweets

If you have performed scraping in Python before, then you must have had used ‘Requests’ and ‘Beautiful Soup’; for those who have not heard of this before, Request is a Python HTTP library for sending HTTP requests and Beautiful Soup is an HTML parser to parse the DOM and get the desired information out of it. But we can not use these libraries to scrap tweets from the twitter, because of dynamic and progressive generation of tweets. Now we are left with two options:

a). Selenium

b). tweepy python library

I will show you the implementation of both, By the way, Selenium is the browser mocking tool usually used for testing web pages and tweepy as I mentioned, a python library which provides access for various twitter APIs.

a). Scrap using Selenium:

Assuming you have already imported numpy and pandas. Below is the SeleniumClient class which will perform scraping:

from selenium import webdriver
from selenium.webdriver.common.keys import Keysclass SeleniumClient(object):
    def __init__(self):
        #Initialization method. 
        self.chrome_options = webdriver.ChromeOptions()
        self.chrome_options.add_argument('--headless')
        self.chrome_options.add_argument('--no-sandbox')
        self.chrome_options.add_argument('--disable-setuid-sandbox')

        # you need to provide the path of chromdriver in your system
        self.browser = webdriver.Chrome('D:/chromedriver_win32/chromedriver', options=self.chrome_options)

        self.base_url = 'https://twitter.com/search?q='

    def get_tweets(self, query):
        ''' 
        Function to fetch tweets. 
        '''
        try: 
            self.browser.get(self.base_url+query)
            time.sleep(2)

            body = self.browser.find_element_by_tag_name('body')

            for _ in range(3000):
                body.send_keys(Keys.PAGE_DOWN)
                time.sleep(0.3)

            timeline = self.browser.find_element_by_id('timeline')
            tweet_nodes = timeline.find_elements_by_css_selector('.tweet-text')

            return pd.DataFrame({'tweets': [tweet_node.text for tweet_node in tweet_nodes]})

        except: 
            print("Selenium - An error occured while fetching tweets.")

In the above code, you need to specify the path of the desired browser’s webdriver or we can just set the environment variable and don’t pass any parameter inside webdriver.Chrome().

You can use this class:

selenium_client = SeleniumClient()

tweets_df = selenium_client.get_tweets('AI and Deep learning')

In tweets_df, you will get the data-frame containing all the scrapped tweets.

b). Fetch tweets using tweepy:

We can create a TwitterClient class:

import tweepy
from tweepy import OAuthHandlerclass TwitterClient(object): 
    def __init__(self):
        # Access Credentials         consumer_key = 'XXXX'
        consumer_secret = 'XXXX'
        access_token = 'XXXX'
        access_token_secret = 'XXXX'try: 
            # OAuthHandler object 
            auth = OAuthHandler(consumer_key, consumer_secret) 
            # set access token and secret 
            auth.set_access_token(access_token, access_token_secret) 
            # create tweepy API object to fetch tweets 
            self.api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
            
        except tweepy.TweepError as e:
            print(f"Error: Twitter Authentication Failed - \n{str(e)}") 

    # Function to fetch tweets
    def get_tweets(self, query, maxTweets = 1000): 
        # empty list to store parsed tweets 
        tweets = [] 
        sinceId = None
        max_id = -1
        tweetCount = 0
        tweetsPerQry = 100
        
        while tweetCount < maxTweets:
            try:
                if (max_id <= 0):
                    if (not sinceId):
                        new_tweets = self.api.search(q=query, count=tweetsPerQry)
                    else:
                        new_tweets = self.api.search(q=query, count=tweetsPerQry,
                                                since_id=sinceId)
                else:
                    if (not sinceId):
                        new_tweets = self.api.search(q=query, count=tweetsPerQry,
                                                max_id=str(max_id - 1))
                    else:
                        new_tweets = self.api.search(q=query, count=tweetsPerQry,
                                                max_id=str(max_id - 1),
                                                since_id=sinceId)
                if not new_tweets:
                    print("No more tweets found")
                    break
                    
                for tweet in new_tweets:
                    parsed_tweet = {} 
                    parsed_tweet['tweets'] = tweet.text 

                    # appending parsed tweet to tweets list 
                    if tweet.retweet_count > 0: 
                        # if tweet has retweets, ensure that it is appended only once 
                        if parsed_tweet not in tweets: 
                            tweets.append(parsed_tweet) 
                    else: 
                        tweets.append(parsed_tweet) 
                        
                tweetCount += len(new_tweets)
                print("Downloaded {0} tweets".format(tweetCount))
                max_id = new_tweets[-1].id

            except tweepy.TweepError as e:
                print("Tweepy error : " + str(e))
                break
        
        return pd.DataFrame(tweets)

In the above code, we need ‘Access Credentials’ to make API calls, these can be obtained from Twitter's developer console, you just need to register your app and give all the valid reasons to get the access. This call can be used in the same way as we used SeleniumClient, In response, we will get a data-frame containing all the fetched tweets.

Which one should you use?

Yes, it is an obvious question. The answer is tweepy because it is fast and more reliable. However, if you don’t have access credentials for Twiter API’s and you don’t want to wait for Twitter approval, then you can go with SeleniumClient. It is always good to know the various approaches of doing any task.

2. Identifying Sentiment type

So sentiment type is nothing but the overall reaction, it can be positive, negative or neutral. In our case, we are only going to consider positive (includes neutral) and negative.

Q. Why should we identify Sentiment type?

Because eventually, we will be training a model which should be capable of classifying negative and positive sentiments on tweets. For this classification, we will be using some supervised learning model, so we need to have a target variable. Sentiment type is going to be our target variable.

I have identified two ways to identify:

a. Using NLTK’s SentimentIntensityAnalyzer (We’ll refer as SIA)
b. Using TextBlob

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlobdef fetch_sentiment_using_SIA(text):
    sid = SentimentIntensityAnalyzer()
    polarity_scores = sid.polarity_scores(text)
    if polarity_scores['neg'] > polarity_scores['pos']:
        return 'negative'
    else:
        return 'positive'

def fetch_sentiment_using_textblob(text):
    analysis = TextBlob(text)
    # set sentiment 
    if analysis.sentiment.polarity >= 0:
        return 'positive'
    else: 
        return 'negative'

We can choose any of them, I personally prefer TextBlob, It gives better categorization.

3. Text Pre-processing

Text obtained from tweets is not clean enough to be used for model training So it needs to be pre-processed first. We may not be able to make it completely clean but should try our best to pre-process as much as possible.

a. Removing ‘@names’:

All the ‘@anyname’ are of no use since they don’t convey any meaning.

def remove_pattern(text, pattern_regex):
    r = re.findall(pattern_regex, text)
    for i in r:
        text = re.sub(i, '', text)
    
    return text# We are keeping cleaned tweets in a new column called 'tidy_tweets'
tweets_df['tidy_tweets'] = np.vectorize(remove_pattern)(tweets_df['tweets'], "@[\w]*: | *RT*")

b. Removing links (http | https)

Links in the text are of no use because they don’t convey any useful information as well.

cleaned_tweets = []

for index, row in tweets_df.iterrows():
    # Here we are filtering out all the words that contains link
    words_without_links = [word for word in row.tidy_tweets.split()        if 'http' not in word]
    cleaned_tweets.append(' '.join(words_without_links))

tweets_df['tidy_tweets'] = cleaned_tweets

c. Dropping duplicate rows

We may have duplicate tweets in our data-frame, it needs to be taken care of:

tweets_df.drop_duplicates(subset=['tidy_tweets'], keep=False)

d. Removing Punctuations, Numbers and Special characters

tweets_df['absolute_tidy_tweets'] = tweets_df['tidy_tweets'].str.replace("[^a-zA-Z# ]", "")

This step should not be followed if we also want to do sentiment analysis on __key phrases__ as well, because semantic meaning in a sentence needs to be present. So here we will create one additional column ‘absolute_tidy_tweets’ which will contain absolute tidy words which can be further used for sentiment analysis on __key words__.

e. Removing Stop Words

Stop words are the words in that are used just for the sake of correct sentence formations. They don’t have any meaning full information. So it needs to be removed to make our text record cleaner.

from nltk.corpus import stopwords
nltk.download('stopwords')stopwords_set = set(stopwords.words("english"))
cleaned_tweets = []

for index, row in tweets_df.iterrows():
    
    # filerting out all the stopwords 
    words_without_stopwords = [word for word in row.absolute_tidy_tweets.split() if not word in stopwords_set]
    
    # finally creating tweets list of tuples containing stopwords(list) and sentimentType 
    cleaned_tweets.append(' '.join(words_without_stopwords))

tweets_df['absolute_tidy_tweets'] = cleaned_tweets

f. Tokenization and lemmatization:

from nltk.stem import WordNetLemmatizer# Tokenization
tokenized_tweet = tweets_df['absolute_tidy_tweets'].apply(lambda x: x.split())
# Finding Lemma for each word
word_lemmatizer = WordNetLemmatizer()
tokenized_tweet = tokenized_tweet.apply(lambda x: [word_lemmatizer.lemmatize(i) for i in x])
#joining words into sentences (from where they came from)
for i, tokens in enumerate(tokenized_tweet):
    tokenized_tweet[i] = ' '.join(tokens)

tweets_df['absolute_tidy_tweets'] = tokenized_tweet

4. Feature Extraction

We need to convert textual representation in the form of numeric features. We have two popular techniques to perform feature extraction:

Bag of words (Simple vectorization)
TF-IDF (Term Frequency — Inverse Document Frequency)

We will use extracted features from both one by one to perform sentiment analysis and will compare the result at last.

Check out my below kernel to properly understand the intuition behind feature extraction techniques with examples:
https://www.kaggle.com/amar09/text-pre-processing-and-feature-extraction

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer# BOW features
bow_word_vectorizer = CountVectorizer(max_df=0.90, min_df=2, stop_words='english')
# bag-of-words feature matrix
bow_word_feature = bow_word_vectorizer.fit_transform(tweets_df['absolute_tidy_tweets'])
# TF-IDF features
tfidf_word_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, stop_words='english')
# TF-IDF feature matrix
tfidf_word_feature = tfidf_word_vectorizer.fit_transform(tweets_df['absolute_tidy_tweets'])

5. Model Building

Let’s map target variable to {0,1} first.

target_variable = tweets_df['sentiment'].apply(lambda x: 0 if x=='negative' else 1 )

We are going to use Naive Bayes model for sentiment classification because I tried SVM, Logistic Regression and Decision Tree as well, but got the best results using Naive Bayes only.

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_scoredef naive_model(X_train, X_test, y_train, y_test):
    naive_classifier = GaussianNB()
    naive_classifier.fit(X_train.toarray(), y_train)

    # predictions over test set
    predictions = naive_classifier.predict(X_test.toarray())
    
    # calculating f1 score
    print(f'F1 Score - {f1_score(y_test, predictions)}')

Training for features extracted using Bag of Words:

X_train, X_test, y_train, y_test = train_test_split(bow_word_feature, target_variable, test_size=0.3, random_state=870)
naive_model(X_train, X_test, y_train, y_test)

It gives F1 Score — 0.9387254901960784

Now lets train for features extracted from TF-IDF:

X_train, X_test, y_train, y_test = train_test_split(tfidf_word_feature, target_variable, test_size=0.3, random_state=870)
naive_model(X_train, X_test, y_train, y_test)

I got F1 Score — 0.9400244798041616

TF-IDF features clearly give the better score.

Conclusion: Here, for sentiment analysis, we have just used ‘key words’ only, we can use ‘key phrases’ as well. There are many other steps that we should perform, to know about them in detail check out my complete kernel.

Sentiment Analysis on Scrapped Tweets | Kaggle

Edit description

www.kaggle.com

In this post, I have assumed that you already have a basic knowledge of text processing using NLTK. If you don’t have that then I will suggest you go through my below kernel that contains the explanation of all basic text operations and detailed explanation of Feature Extraction using BOW and TF-IDF.

Text Pre-processing and Feature extraction | Kaggle

Edit description

www.kaggle.com

Please comment if you want more explanation on anything. All suggestions and feedbacks are always welcome.

Thanks for reading, Happy Learning ;)

Scraping tweets and analyzing Social Sentiments

Sentiment Analysis on Scrapped Tweets | Kaggle

Edit description

Text Pre-processing and Feature extraction | Kaggle

Edit description

Written by Amardeep Chauhan