Incompatibility and predictability — Salvini vs Di Maio: a twitter-based analysis

Published in

Towards Data Science

9 min readNov 10, 2018

Luigi Di Maio (left) and Matteo Salvini (right). Image from liberoquotidiano.it

Recent general elections in Italy gave birth to the so-called ‘yellow-green’ hybrid coalition, with MPs from the 5-star movement (5SM) and Lega Nord (LN) parties holding the majority in both chambers. However, their very different political backgrounds and priorities are prompting a troubled coexistence, questioning the compatibility of the two factions in such a unique body. Nevertheless, uniting the penta-starred and the nothern league instances is a common concern for the lowest classes, which is usually addressed via unsatisfiable pledges and disregard towards other parties and/or older national and supernational institutions, e.g. the EU; in one word: populism.

Is this enough to guarantee a long-lived (5-years long, theoretically) government in Italy? To answer to this question we need to quantify the overlap between the two forces — represented by their leaders Luigi di Maio (5SM) and Matteo Salvini (LN) — in some way. What I will do here is to perform a simple yet significative analysis of recent tweets from the two leaders using Python. We will see that the overlap is quantified as low as ~15%. To supplement my study, taking inspiration from the post by 3mi1y here, I will measure — or better, let you measure — the predictability of the two leaders. More precisely, I will employ a first-order Markov model trained on their historical Twitter records to build a Salvini- and/or Di Maio-tweet generator.

Obviously, Italian speakers have an edge in fully grasping the faithfulness of my results. However, the post might be relevant to users willing to perform a similar analysis concerning Twitter data.

Importing tweets

To import the tweets of a specific user, it is mandatory to apply for a developer Twitter account, which can be done here. Then, it suffices to create a personal app (motivating its usage etc., as per new regulations) to obtain:

consumer key
consumer secret
access token
access token secret

Once done that, we’re all set to import the tweets. I used the tweepy package and followed the procedure reported here

#fill below with your credentials
consumer_key = '......'
consumer_secret = '...yours...'
access_token = '...'
access_token_secret = '...'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)collect_tweets_Salvini=[]
collect_tweets_DiMaio=[]
for status in tweepy.Cursor(api.user_timeline, screen_name='@matteosalvinimi').items():
    #Split and re-join to get rid of \n and similar charachters
    temp = status._json['text']    temp = ' '.join(temp.split())
    collect_tweets_Salvini.append(temp)
    
for status in tweepy.Cursor(api.user_timeline, screen_name='@luigidimaio').items():
    temp = status._json['text']
    temp = ' '.join(temp.split())
    collect_tweets_DiMaio.append(temp)

This code returns two lists collect_tweets_... containing a large number of tweets (~3200 each) fromthe two politicians. There are many alternative ways to get those, for instance see here.

Part I: Exploratory Data Analysis and overlap measure

Data cleansing and functions definition

Before performing any type of analysis, we need to clean our data. There are types of tweets which are not relevant to us, for instance retweets. Also, we want to get rid of @, #, emoji, etc. I define the following functions:

The process of cleaning a tweet with the function clean_tweets() goes as follows data: first it is pruned with prune_tweet(), which check all the words in the tweet and discard or modify them if they don’t meet some requirement (for instance if they start with @) a goal achieved with goodWord() . Then, by means of tweet_process() , punctuation and frequent words in the Italian language (e.g. articles) are removed. Finally we also get rid of duplicate words. This all process is achieved by running:

clean_Salvini = clean_tweets(tweets_Salvini)
clean_DiMaio = clean_tweets(tweets_DiMaio)

creating a list of lists, where each element is a different cleaned tweet.

A very useful tool to check the composition of the corpus of tweets for each politician, are wordclouds. We flatten the lists of lists to generate a collection of all the words used by the two politicians, and use them as an input to create the wordclouds.

Here the results

The wordclouds show some degree of similarity (‘governo’, ‘oggi’ , ‘diretta’ have more or less the same size for both of them). The most evident difference concerns the appearance of persons’ names: a large usage of the word ‘Salvini’ for the LN leader and ‘Renzi’ (former PM and democratic party leader) for the M5S one. I’ll let you draw the conclusions here, which will be easier by looking at the word-appearances count below

Ok, I would have really liked to be as less politically biased as possible by not drawing any conclusion at all here, but the fact that the usage of the word ‘salvini’ outdoes all the others — across the two lists — by roughly an order of magnitude deserves some comment. That is due to mainly two reasons: the first one, is that the LN leader loves to start his tweets with the hashtag #Salvini. The second one is a victim-like attitude towards communication, in that many posts report others’ comments against him, which Salvini in turns is able to propose in a new light to fuel his electorate’s support. You can check this explicitly by scanning through the list of tweets.

Now, how to measure the overlap against the two? The most common technique in Natural Language Processing is to vectorize each tweet to make it suitable for numerical analysis. This involves tweets tokenization and related stuff (see e.g. here). The latter analysis is particularly relevant for classification problems, a classical example being the classification of an email as spam or not. Trying to force this approach to our case is obviously not going to work: chances that Di Maio is classified as Salvini in terms of tweets similarity are very low. In the end we are talking about two different human being…Therefore the method I’m going to use is as follows:

Take a politician as a reference, get his top 20 words (Fig. 3, 4);
Find his 5 tweets containing the largest occurrence of words in the top 20 -list, weighted by a function yielding their relative importance(top-5 list);
Compare a tweet t2 of the second politician with one tweet from the top-5 list above t1. Measure their overlap as the fraction of words in the t2 appearing in t1
Average out the 5 overlaps so obtained.

It is easier to understand the process by illustrating it practically. Let us take Salvini as the reference politician. His top 20 words are reported in Fig. 3. Now the associated top-5 list of his tweets is obtained by scanning through the corpus of tweets and counting how many times each of the top-20 words appears, which we weight, to start, by their occurrence in the corpus. As an example, the original tweet (classified as the most important, seems like it’s working…)

'#Salvini: dopo mesi di lavoro offro agli italiani, che mi pagano lo stipendio, non solo il decreto, ma anche un....'

which corresponds to the pruned tweet

'decreto lavoro offro solo dopo salvini pagano stipendio italiani pro mesi'

contains the words (and their relative weight, see Fig. 3): (‘lavoro’, 115), (‘solo’,109), (‘dopo’, 98), (‘Salvini’, 864), (‘italiani’, 186). The partial ‘score’ of this tweet is hence: score_sum = 115+109+98+864+186 = 1372. We need to take into account that shorter tweets have lower probability of having a higher score. Then we divide the score by the length of the pruned tweet, length = 11 in the case above. In addition to that, to perform a meaningful analysis, we want to set a lower bound to the length of the pruned tweet, say 10. The score of each tweet results as

score = score_sum / length * \Theta (length-10)

where \Theta (length) is the Heaviside step function, giving 1 for positive argument and 0 otherwise.

The code:

Now that we have our top-5 list, it’s time to compare each of its elements with the tweets by Di Maio. For a single tweet by Di Maio t2, the five overlaps are quantified as the number of words in t2 contained in each of the t1. This gives rise to a #total tweets by Di Maio (= 3191) x 5 matrix . We take the maximum value of the overlap for each column and take the average of them. find the maximum is quantified at 15%, as stated in the introduction (see below). Well, if you believe this analysis…

Figure 5: Statistical properties of the overlap matrix described above.

Part II: Markov chain

Finally we employ a discrete-time Markov chain to mimic the leaders in tweets’ generation. A (first-order) Markov chain is a stochastic process where the likelihood of an event X_m at a time t_m depends only on the previous time t_(m-1), with no memory preserved about the time t_(m-2), t_(m-3), etc. That means that the probability of having a series of event from X_1 to X_n can be factorized as

where p(X_m |X_{m-1}) is the conditional probability of having X_m given X_{m-1}.

To the train the Markov chain on our twitter data, we do the following:

Take the set of all words in the corpus
Fixing one of them, count the number of occurrence of every each other word after it.
Normalize to translate the occurrences in probabilities.
Iterate over the set.

For instance in the sentence ‘Hello world, hello moon: today is Saturday’, we would have the set of unique words (hello, world, moon, today, is, Saturday). Then, we fix the word ‘hello’ and count the occurrence of each word in the sentence after it:

hello → [hello: 0, world:1, moon: 1, today:0, is:0, Saturday:0],

which gives, dividing each count by the total number of occurrences = 2, the freqencies (alias probabilities) of having a word given another one:

p(world | hello) = p(moon | hello) = 0.5,

p(hello | hello) = p(today | hello) = p(is | hello) = p(Saturday | hello) =0.

The Python way to create such objects is to define dictionaries. We perform this operation on our corpus of tweets by means of the following fucntions

Note here we want to preserve articles, punctuation etc. in the corpuses of tweets. Therefore we only apply the function prune_tweet() rather than clean_tweets(). As it can be seen above I stop the tweet generation as soon as any 4 punctuation marks are extracted. We run

clean_Salvini_WP = list(map(prune_tweet,tweets_Salvini))
clean_DiMaio_WP = list(map(prune_tweet,tweets_DiMaio))dict_S = firstMarkov(clean_Salvini_WP)
Salvini = normalizeFirstMarkov(dict_S)dict_DM = firstMarkov(clean_DiMaio_WP)
DiMaio = normalizeFirstMarkov(dict_DM)

Here a couple of sentences generated by this method:

Salvini: ‘ Pasolini scriveva in Italia si è rinchiuso nel suo orgoglio anti-salvini. 7 mesi del centro di invalidità. Questa mattina deve rispettare l’Italia, rozzo’.

Di Maio: ‘ Risparmiati da noi, l’Italia sarebbe stata oggi in Italia. Un subappalto, renzopoli’.

Do they sound like them? hahaha.

Obviously this simple model can be improved by taking higher-order Markov chains or weighting the probability function with some other to take into account common patterns in the language (e.g. adding an exra weight to distinguish articles and adjectives, etc.). Or one can vary the output, by taking words from the two leaders in arbitrary admixtures. I’ll do that as a follow-up to this simple project.

Suggestions are welcome!

Thanks

Incompatibility and predictability — Salvini vs Di Maio: a twitter-based analysis

Importing tweets

Part I: Exploratory Data Analysis and overlap measure

Data cleansing and functions definition

Part II: Markov chain

Written by Manuel Offidani