The world’s leading publication for data science, AI, and ML professionals.

Visualisation of Information from Raw Twitter Data – Part 2

Want to find out user activity, see if certain users are Bots, make a Time Series of the Tweet Publications and much more? Read on then!

The previous post covered how to download data from Twitter regarding a certain topic, getting this data ready in a Jupyter Notebook, discovering insights from this data, and explored some very cool visualisation techniques. If you have not read it, you can find it here:

Visualization of Information from Raw Twitter Data – Part 1

This second post will describe other awesome visualisations, while also exploring some more information that can be obtained from the downloaded tweets.

We will start by discovering information about the users who are posting the tweets:

#Lets take a look at the users who are posting these tweets:
print("There are {} different users".format(tweets['Username'].nunique()))

In my case, the tweets were posted by 59508 different users.

Using our neatly prepared dataframe we can see who are the users that posted the most tweets, and something even cooler: see the chance of the highly active users being Bots!

#Going to see who are the users who have tweeted or retweeted the #most and see how
#Likely it is that they are bots
usertweets = tweets.groupby('Username')
#Taking the top 25 tweeting users
top_users = usertweets.count()['text'].sort_values(ascending = False)[:25]
top_users_dict = top_users.to_dict()
user_ordered_dict =sorted(top_users_dict.items(), key=lambda x:x[1])
user_ordered_dict = user_ordered_dict[::-1]
#Now, like in the previous hashtags and mention cases, going to make #two lists, one with the username and one with the value
dict_values = []
dict_keys = []
for item in user_ordered_dict[0:25]:
    dict_keys.append(item[0])
    dict_values.append(item[1])

This block of code is very similar to the ones we used in the previous post to see the most used hashtags or mentioned users. Now, like in these earlier cases, we will plot the results.

#Plot these results
fig = plt.figure(figsize = (15,15))
index = np.arange(25)
plt.bar(index, dict_values, edgecolor = 'black', linewidth=1)
plt.xlabel('Most active Users', fontsize = 18)
plt.ylabel('Nº of Tweets', fontsize=20)
plt.xticks(index,dict_keys, fontsize=15, rotation=90)
plt.title('Number of tweets for the most active users', fontsize = 20)
plt.savefig('Tweets_of_active_users.jpg')
plt.show()
Bar chart of the number of tweets produced by the top 25 tweeting users.
Bar chart of the number of tweets produced by the top 25 tweeting users.

As we can see, the most active user is @CrytoKaku, with more than 400 posted tweets. That is a lot! Is he/she a Bot? Lets check it out!

For this we need to download and import the Botometer Python library, and get a key to be able to use their API. The information on how do to this can be found on the following link:

Botometer API Documentation (OSoMe) | RapidAPI

Also, we will need to retrieve our Twitter API keys, as we will need them to allow Botometer to access the information from the accounts whose activity we want to study.

First we will import both libraries. Take into account that Botometer and Tweepy both have to be previously downloaded using a package manager of your choice.

#Now we will see the probabilities of each of the users being a bot #using the BOTOMETER API:
import botometer
import tweepy

After this, we will input the API keys that are needed:

#Key from BOTOMETER API
mashape_key = "ENTER BOTOMETER API KEY"
#Dictionary with the credentials for the Twitter APIs
twitter_app_auth = {
    'access_token' : "ENTER ACCESS TOKEN",
    'access_token_secret' : "ENTER ACCESS TOKEN SECRET",
    'consumer_key' : "ENTER CONSUMER KEY",
    'consumer_secret' : "ENTER CONSUMER SECRET",    
}

Like in the previous posts, replace the ‘ENTER…‘ with the corresponding key and you’re good to go.

Run the following block of code to access the Botometer API, and lets see which accounts have the highest chance of being Bots out of the top 25 tweeting users!

#Connecting to the botometer API
bom = botometer.Botometer(wait_on_ratelimit = True, mashape_key = mashape_key, **twitter_app_auth)
#Returns a dictionary with the most active users and the porcentage #of likeliness of them bein a Bot using botometer
bot_dict = {}
top_users_list = dict_keys
for user in top_users_list:
    user = '@'+ user
    try:
        result = bom.check_account(user)
        bot_dict[user] = int((result['scores']['english'])*100)
    except tweepy.TweepError:
        bot_dict[user] = 'None'
        continue

The output of this block is a dictionary _(botdict) where the keys are the names of the accounts we are checking, and the value is a numerical score between 0 and 1 that depicts the probability of each user being a bot by taking into account certain factors like the ration of followers/followees, the description of the account, frequency of publications, type of publications, and more parameters.

For some users, the Botometer API gets a rejected request error, so these will have a ‘None‘ as their value.

For me, I get the following results when checking _bot_dict_:

{'@CryptoKaku': 25,
 '@ChrisWill1337': 'None',
 '@Doozy_45': 44,
 '@TornadoNewsLink': 59,
 '@johnnystarling': 15,
 '@brexit_politics': 42,
 '@lauramarsh70': 32,
 '@MikeMol1982': 22,
 '@EUVoteLeave23rd': 66,
 '@TheStephenRalph': 11,
 '@DavidLance3': 40,
 '@curiocat13': 6,
 '@IsThisAB0t': 68,
 '@Whocare31045220': 'None',
 '@EUwatchers': 34,
 '@c_plumpton': 15,
 '@DuPouvoirDachat': 40,
 '@botcotu': 5,
 '@Simon_FBFE': 42,
 '@CAGeurope': 82,
 '@botanic_my': 50,
 '@SandraDunn1955': 36,
 '@HackettTom': 44,
 '@shirleymcbrinn': 13,
 '@JKLDNMAD': 20}

Out of these, the account with the highest chance of being a Bot is @CAGeurope, with a probability of 82%. Lets check out this account to see why Botometer assigns it such a high probability of being a Bot.

Twitter account of @CAGeurope
Twitter account of @CAGeurope

It looks like a legit account, however, there are various reasons that explain why Botometer gave it such a high probability of being a Bot. First, the account follows almost 3 times as many accounts as the number of accounts that follow it. Secondly, if we look at the periodicity of their tweet publications, we can see that they consistently produce various tweets every hour, sometimes in 5 minute intervals, which is a LOT of tweets. Lastly, the content of their tweets is always very similar, with a short text, an URL and some hashtags.

In case you don’t want to code anything or get an API key, Botometer also provides a web based solution, where you can also check the probability of an account being a Bot:

Web based solution offered by Botometer
Web based solution offered by Botometer

Looks like I’m going to have to stop spamming the retweet button and mass following people in order to make my Twitter account more human-like 😛

Cool! We can see a lot more information about the users through the ‘user‘ object in the tweet’s JSON, however, this will be left for a different post.

Now, lets make a Time series of the tweet publications, so we can see on which days there were more tweets about the chosen topic being produced, and try to find out which events caused these higher tweet productions.

We will plot the number of tweets being published on each day of a specific month. To show a plot similar to this one, but for a longer period of time, some additional code would have to be added.

First we need to modify the ‘Timestamp‘ field of our dataframe, to convert it to a Datetime object, using Pandas incorporated function _todatetime.

tweets['Timestamp'] = pd.to_datetime(tweets['Timestamp'], infer_datetime_format = "%d/%m/%Y", utc  = False)

Then, we create a function that returns the day of the DateTime object, and apply it to our ‘Timestamp’ field to create a new column for our dataframe that stores the day when the tweet was published. Also, we will group the days together, count the number of tweets (using the ‘text’ field) produced on each day, and create a dictionary (timedict) with the results, where the keys are the number corresponding to the day of the month and the values are the number of tweets published on that day.

def giveday(timestamp):
    day_string = timestamp.day
    return day_string
tweets['day'] = tweets['Timestamp'].apply(giveday)
days = tweets.groupby('day')
daycount = days['text'].count()
timedict = daycount.to_dict()

After doing this, we are ready to plot our results!

fig = plt.figure(figsize = (15,15))
plt.plot(list(timedict.keys()), list(timedict.values()))
plt.xlabel('Day of the month', fontsize = 12)
plt.ylabel('Nº of Tweets', fontsize=12)
plt.xticks(list(timedict.keys()), fontsize=15, rotation=90)
plt.title('Number of tweets on each day of the month', fontsize = 20)
plt.show()
Time Series of 2 days tweet collection for the #Brexit (Left) and for a whole month on the #Oscars (right)
Time Series of 2 days tweet collection for the #Brexit (Left) and for a whole month on the #Oscars (right)

If like me, you only collected tweets for a couple of days, you will get a very short time series, like the image on the left. The one on the right however, shows a full month time series made from a Dataset of tweets about the #Oscars, which was built by querying the Streaming API for tweets for more than one month. In this second time series, we can see how there are very few tweets being produced __ at the beginning of the month, and as the day of the ceremony comes closer the _twee_t production starts going up, to reach its peak on the night of the event.

Awesome! Now, we will make a plot about the devices where the tweets are being produced from.

As the code is pretty much the same code that was used for the previous bar plots, I will just post it here with no further explanation:

#Now lets explore the different devices where the tweets are #produced from and plot these results
devices = tweets.groupby('device')
devicecount = devices['text'].count()
#Same procedure as the for the mentions, hashtags, etc..
device_dict = devicecount.to_dict()
device_ordered_list =sorted(device_dict.items(), key=lambda x:x[1])
device_ordered_list = device_ordered_list[::-1]
device_dict_values = []
device_dict_keys = []
for item in device_ordered_list:
    device_dict_keys.append(item[0])
    device_dict_values.append(item[1])

Now we plot and see the results:

fig = plt.figure(figsize = (12,12))
index = np.arange(len(device_dict_keys))
plt.bar(index, device_dict_values, edgecolor = 'black', linewidth=1)
plt.xlabel('Devices', fontsize = 15)
plt.ylabel('Nº tweets from device', fontsize=15)
plt.xticks(index, list(device_dict_keys), fontsize=12, rotation=90)
plt.title('Number of tweets from different devices', fontsize = 20)

plt.show()
Plot of tweet production from different devices
Plot of tweet production from different devices

By looking at this chart we can see that most tweets are published from smartphones, and that inside of this category Android devices beat Iphones by a small margin.

The web produced tweets could also be from a mobile device, but are produced from a browser and not from the Twitter app. Aside from this web produced tweets (which we can not tell if are published from a PC, Mac or mobile web browser), there are very few tweets coming from recognised Macs or Windows devices. These results fit very well with the relaxed and easy going nature of the Social Network.

Lastly, lets look at some additional information that can be easily obtained from the gathered data

#Lets see other useful information that can be gathered:
#MEAN LENGTH OF THE TWEETS
print("The mean length of the tweets is:", np.mean(tweets['length']))
#TWEETS WITH AN URL
url_tweets = tweets[tweets['text'].str.contains("http")]
print(f"The percentage of tweets with Urls is {round(len(url_tweets)/len(tweets)*100)}% of all the tweets")
#MEAN TWEETS PER USER
print("Number of tweets per user:", len(tweets)/tweets['Username'].nunique())

For me this is 145 for the mean length of the tweets, 23% of the tweets have an URL, and the mean tweet production per user is of 2.23 tweets.

Thats it! You can find the Jupyter Notebook used for this post and the previous one here, along with the scripts and notebooks for my other posts regarding Twitter data collection.

Also, for more awesome resources about Natural Language Processing and Machine Learning, check out this awesome blog: How to Learn Machine Learning.

Thanks a lot for reading, please clap, keep tweeting and see you soon!


Related Articles