Visualizing Electoral Data: Polarization and Mobilization During the 2021 Chilean Presidential Elections

A case study from Chile’s last presidential elections

Published in

Towards Data Science

13 min readJun 20, 2022

On December 21st took place one of the most disputed presidential races in the history of Chile. The two-round voting system left all traditional parties behind in a context of political antagonism that had not been seen for quite a long time and led to the disqualification of conventional parties.

But the purpose of this article is not to discuss politics. Instead, it’s about investigating one interesting fact about this election: the number of expressed votes (i.e., that are not blank nor invalid) strongly increased between the two rounds, from 7,028,345 expressed votes up to 8,271,893 for the second round, according to data of the National Electoral Service (SERVEL). That’s an increase of 17.7%, almost 1.25 million votes!

Abstention is usually high in Chile. The electoral register counts 15 million voters, of which barely 47.5% participated in the first round and 55.9% for the second, yet considered one of the country’s best scores. However, I’ll focus on the increase of expressed votes only, as the reasons behind abstention are a whole different matter that I won’t deal with here.

As I said, these elections were highly polarized. The increase of expressed ballots for the second round has benefited the winning candidate, Gabriel Boric. To scrutinize this phenomenon, I’ve analyzed the polling data from SERVEL, which gives us detailed information on the ballot in all of the 46,888 polling places throughout the country and abroad: number of votes obtained by each candidate, blank and invalid votes, abstention, location, gender, etc.

As a reminder, before all things, let’s list all candidates in both rounds with their respective scores. The second round saw the victory of Gabriel Boric Font with 55.87% of the ballots, defeating José Antonio Kast, who scored 44.13%.

During the first round, José Antonio Kast arrived at 27.91%, whereas Gabriel Boric made it to second place with 25.83% of the ballots. The remaining candidates were Francisco Parisi (12.80%), Sebastián Sichel (12.78%), Yasna Provoste (11.60%), Marco Enríquez-Ominami (7.60%), and Eduardo Artés (1.47%).

# libraries used
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm
from matplotlib.ticker import PercentFormatter
import seaborn as sn# preparation of the data includes:
   # importing first-round and second-round datasets
   # joining data from polling places within the country and abroad 
   # cleaning the data
   # counting total number of expressed ballots per polling station
    (abstention and blank/invalid votes are not taken into account)
   # computing percentage of each candidate in each polling station
   # returning a NumPy array of all 46,888 scores (one per polling
     station), for both the 7 first-round candidates and the 2
     second-round candidates 
   # calculating the difference of expressed votes between the two
     rounds to measure the electoral mobilization in each polling 
     station. Outliers are set to a limit to narrow the
     spread of the array when plotted as a heatmap# all operations are regularly asserted throughout the script to be consistent with the totals provided by SERVEL in a separate sheet. # the detailed script is available on GitHub.

Comparing first- and second-round results in each polling station throughout the country.

For each of the seven candidates, I displayed polling places on the x-axis of a scatter plot according to the score obtained by the candidate in the said polling place. Each polling place moves along the y-axis according to the second-round score to see how the voters of this polling place reacted to the second-round duel.

# candidate is an array of scores of the 1st-round candidate 
# candidate2 is an array of scores of the 2nd-round candidate
# diff_votes_perc is an array of the differences of expressed votes between the two rounds# plot 
sns.scatterplot(
x=candidate, 
y=candidate2, 
hue=diff_votes_perc, 
palette='coolwarm', 
alpha=0.5, 
ax=ax
)# legend
ax.legend(
title=f'Gain/loss of votes between\nthe two rounds (in %)', 
title_fontsize='small', 
fontsize='small'
)

A candidate’s electorate consists of the polling places furthest to the right, where the best first-round scores are. How this electorate locates on the y-axis reveals how it behaved in the second round. In other words, it indicates the support of a first-round candidate’s electorate to one of the two runoff candidates.

The graphs also show the electoral mobilization in every polling station. The color of a point indicates whether the number of expressed votes increased or decreased between the first and the second round, following a heatmap logic: a warm red indicates a strong mobilization, whereas a cold blue means the participation decreased.

# check distribution of the difference of expressed ballots array
plt.hist(diff_votes_perc)
plt.show()# narrow range of the array to avoid outliers 
diff_votes_perc = np.where(
diff_votes_perc < -30, -30, diff_votes_perc
)
diff_votes_perc = np.where(
diff_votes_perc > 55, 55, diff_votes_perc
)

That way, the graphs not only tell us whether a first-round candidate’s electorate rallied or not a runoff candidate but also if the support was an enthusiastic one or not.

Let’s look at an example to clarify this point, for instance, the graph comparing the scores obtained by the traditional right candidate Sebastián Sichel in the first round and the far-right candidate José Antonio Kast in the second round.

There’s a heap of blue points in the upper right of the figure, from which we can draw two conclusions:

First, regarding the existence or lack of support for José Antonio Kast in the second round. The polling places where Sichel registered his best results (right of the x-axis) voted in favor of Kast in the second round (top of the y-axis). Said otherwise, the electorate of Sichel rallied Kast in the second round.
Second, regarding the enthusiasm for this support. The prevalence of blue shows electoral demobilization as there was a loss of expressed ballots between the two rounds. In other words, the electorate of Sichel offered a “demobilized support” to Kast.

Comparing runoff candidates with themselves

It gets even more interesting when we compare the evolution of the scores of the same candidate between the first and the second round. Therefore, we can only make such a comparison with candidates that did make it to the second round.

What would be the point of seeing if a candidate’s electorate did support him in both rounds?

It might seem counterintuitive, as we can reasonably expect polling places that voted massively for a candidate for the first round to vote for the same candidate in the second round.

As a demonstration, let’s first look at the polling places that voted massively for Kast, the defeated runoff candidate.

It looks like a dense swarm, with a swelling on the top.

The linear shape of the scatter plot suggests that his electorate was stable: the more a polling place voted for him in the first round, the more likely it is to have voted massively for him in the second round.

There are exceptions. Some places strongly voted for Kast in the second round, with scores as high as 100%, even though his first-round score was low. But we’re talking about barely a very few polling stations on a total of 46,888, and generally at the cost of demobilized voters as suggested by the blue-colored dots.

The most important conclusion to draw from this graph is less striking. The size of the swelling is not that big, especially on the left. But that upper-left / upper-middle part may precisely be where victory lies. It is vital in an election to gather voters that had not voted for you in the first round. Those voters should exactly appear within the figure drawn in red.

The electoral outbreak to victory

To illustrate better the idea of electoral outbreak, let’s now look at the winning candidate’s scores.

The swelling is more significant and filled with dots. It is composed of two parts: a blue one on the left, which voted strongly for Boric in the second round but demobilized, and a larger red one, which strongly mobilized in favor of Boric.

Let’s display the very same figure with the national average score obtained by Boric in each round.

# get max values of the data to get limit coordinates
X_max = float(max(candidate))
Y_max = float(max(candidate2))

# plot, same as before
sns.scatterplot(
x=candidate, 
y=candidate2, 
hue=diff_votes_perc, 
palette='coolwarm', 
alpha=0.5, 
ax=ax
)# compute national averages of candidates
cand2_mean = float(np.mean(candidate2))
cand_mean = float(np.mean(candidate))# compute number of polling places
nb_pp = int(len(SERVEL_data) / 7)# plot national average of 2nd-round candidate
X_plot = np.linspace(0, X_max, nb_pp)
Y_plot = np.linspace(cand2_mean, cand2_mean, nb_pp)
ax.plot(
X_plot, 
Y_plot, 
color='black', 
linestyle='-.', 
label=f'{candidate2name}\n2nd round: {round(cand2_mean,1)}%'
)# plot national average of 1st-round candidate
X_plot2 = np.linspace(cand_mean, cand_mean, nb_pp)
Y_plot2 = np.linspace(0, Y_max, nb_pp)
ax.plot(
X_plot2, 
Y_plot2, 
color='black', 
linestyle=':', 
label=f'{candidate1name}\n1st round: {round(cand_mean, 1)}%'
)

The upper-right quarter gathers all polling places that voted more for Boric than the national average for both rounds. In contrast with Kast’s figure, we can see that it is not linear-shaped. On the contrary, the red burst expanding to the top highlights the mobilization in favor of Boric.

The upper-left quarter is also quite insightful. It gathers all places that did not vote much for Boric in the first round, less than the national average. However, these polling places voted significantly in his favor during the second round. They mobilized more, as the red color indicates.

The fact that there’s a lot of red in the upper parts of the graph emphasizes that Boric was elected thanks to a crucial electoral mobilization that went way beyond his original electorate. This conclusion is consistent with the fact that he won the election by a handful even though he came second in the first round.

On the contrary, remember that Kast’s swelling was almost empty and full of demobilized polling places, meaning he had failed to attract voters beyond the boundaries of his electorate.

Putting it all together

Here are the complete figures for all the seven first-round candidates. For each one of them, there are three views of the same data:

Expressed votes per polling station, with the national average score of the first-round and second-round candidates. There is no color-based information to focus on the shape of the swarm and how the averages locate.
Heatmap of the electoral mobilization between the two rounds. That’s the visualization we’ve been seeing so far.
Vote per region. Another kind of information display for which I’ve faced the strangest challenge: ranking areas according to two different types of numbering (some by their geographical position, others by their creation date).

Images by the author

Here’s the script to generate these plots. First, we define candidates and set the general features of the figure. As only three subplots are displayed, we can put a customized legend in the upper-right quarter instead of a fourth one.

for i, candidate in enumerate(
[Boric, Kast, Provoste, Sichel, Artés, Ominami, Parisi]
):
    fig, axs = plt.subplots(2, 2, figsize=[15,10])

    # extract name of the 1st round candidate
    candidate1name = names[i].title()

    # define candidate 2nd round to compare to
    if i == 1 or i == 3:
        candidate2name= 'José Antonio Kast Rist'
        candidate2 = Kast2
    else:
        candidate2name= 'Gabriel Boric Font'
        candidate2 = Boric2

    # format x and y axis in percentages
    for a, b in [(0,0), (0,1), (1,0), (1,1)]:
        axs[a][b].xaxis.set_major_formatter(PercentFormatter())
        axs[a][b].yaxis.set_major_formatter(PercentFormatter())

    # put the title in the second plot
    axs[0][1].annotate(
text=f"2nd round behavior of\n{candidate1name}'s electorate",
xy=[0.5,0.8], 
horizontalalignment='center', 
fontsize=20, 
fontweight='bold'
)    # add general description
    axs[0][1].annotate(
text='Comparison of the results obtained at each round of the\n2021 Chilean presidential elections (by polling station)',
xy=[0.5,0.6],
horizontalalignment='center',
fontsize=12,
fontstyle='italic'
)    # annote customized legend
    axs[0][1].annotate('Legend:\n'
              '1 - Expressed votes per polling station (in %)\n'
              '2 - Electoral mobilization between the two rounds\n'
              '3 - Vote per region',
xy=[0.05,0.05], 
horizontalalignment='left', 
fontsize=12, 
fontweight='light',
backgroundcolor='white', 
bbox=dict(edgecolor='black', facecolor='white',boxstyle='round')
)    # fetch limit coordinates of each plot 
    X_max = float(max(candidate))
    Y_max = float(max(candidate2))    # put numbered references of the legend in the upper-right corner of each subplot
    axs[0][0].annotate(
text='1', 
xy=[X_max,90], 
color='darkred', 
fontsize=20, 
fontweight='black'
)
    axs[1][0].annotate(
text='2', 
xy=[X_max,90], 
color='darkred', 
fontsize=20, 
fontweight='black'
)
    axs[1][1].annotate(
text='3', 
xy=[X_max,90], 
color='darkred', 
fontsize=20, 
fontweight='black'
)    # hide axis
    axs[0][1].axis('off')

    # set labels of the general figure
    fig.supylabel(
f'{candidate2name} - 2nd round results', 
fontsize=16, 
ha='center', 
va='center'
)
    fig.supxlabel(
f'{candidate1name} - 1st round results', 
fontsize=16, 
ha='center', 
va='center'
)

We now generate the scatter plot with national averages. Remind that we are still in the same “for loop”.

    # plot comparison of expressed votes in the first subplot
    sns.scatterplot(
x=candidate, 
y=candidate2, 
color=colors[i], 
alpha=0.3, 
ax=axs[0][0]
)# define variables to plot national averages of candidates
    cand2_mean = float(np.nanmean(candidate2))
    cand_mean = float(np.nanmean(candidate))
    nb_pp = int(len(SERVEL_data) / 7)    # plot national average of 2nd-round candidate
    X_plot = np.linspace(0, X_max, nb_pp)
    Y_plot = np.linspace(cand2_mean, cand2_mean, nb_pp)    axs[0][0].plot(
X_plot, 
Y_plot, 
color='black', 
linestyle='-.', 
label=f'{candidate2name}\n2nd round: {round(cand2_mean,1)}%'
)    # plot national average of first-round candidate
    X_plot2 = np.linspace(cand_mean, cand_mean, nb_pp)
    Y_plot2 = np.linspace(0, Y_max, nb_pp)    axs[0][0].plot(
X_plot2, 
Y_plot2, 
color='black', 
linestyle=':', 
label=f' {candidate1name}\n1st round: {round(cand_mean, 1)}%'
)
    axs[0][0].legend(
fontsize='small', 
title='National average', 
title_fontsize='small'
)

Then the electoral mobilization heatmaps that we’ve already seen.

# plot electoral mobilization in the third subplot
    sns.scatterplot(
x=candidate, 
y=candidate2, 
hue=diff_votes_perc, 
palette='coolwarm', 
alpha=0.5, 
ax=axs[1][0]
)    # legend with total number of votes in both rounds, as well as increase of participation in %
    axs[1][0].legend(
title=f'Gain/loss of votes between\nthe two rounds (in %)', title_fontsize='small', 
fontsize='small'
)

Chile is one of the few countries where regions can be a numerical variable, and not a categorical one.

And last, a bonus region plot. For those unfamiliar with Chile’s geography, it’s the world’s longest country from north to south, at 4,270 km long.

It goes from the world’s driest desert in the north to the antarctic in the south and gathers all kinds of climates. But it’s narrow, stuck between the sea and the mighty Andes. So on average, it’s only 177 km large.

Regions pile on top of the other, and no precise east/west layering appears on the map. We can take advantage of this peculiar geography and attribute shades of colors to the dots according to their position on a north/south axis. After all, Chile is in some way shaped like an axis!

In other words, we can order Chile’s regions numerically. There aren’t many countries where you can do that! In most countries, a colored plot of the regional layout will be categorical and is less likely to provide significant visual insights.

So, we can display geographical data in Chile as shades from north to south. This kind of scatter plot will give us a hint about what’s the approximate location of polling stations at a bare glance. It brings some interesting insights for some candidates, such as Parisi and Provoste, whose electorates locate in northern Chile.

So, back to the script. We want regions ordered from north to south. The good thing is, they are numbered, apparently from north to south. But if you look at the map of Chile’s regions above, you’ll see that not all of the numbering makes sense.

Chile has created new regions on several occasions. That’s a tricky one! Initially, the criterion to number the first regions was their geographical position. But several new regions have been created meanwhile, and their rank stems from their order of creation, not their geographical position.

We can imagine many ways of ordering all regions correctly from north to south, but it might as well be done manually through indexing, zipping, and NumPy.

    # plot votes according to region in the last subplot
    # a reordering the position to the north is necessary to create a readable heatmap    # instantiate a list with the result of 
    regions = np.unique(location_array)    # zip the region list with a list of their respective position starting from the north
    north_to_south = [3, 1, 4, 15, 5, 12, 14, 13, 16, 2, 6, 10, 11, 8, 9, 17, 7]
    region_position = zip(regions, north_to_south)    # create an array of the regional position of each polling place
    position_array = np.empty(len(location_array))
    for region, position in region_position:
        position_array[location_array == region] = position    # stack all arrays of interest into a single one
    ordered_array = np.column_stack(
[candidate, candidate2, position_array]
)    # sort array according to the regional position
    sorted_array = ordered_array[
np.argsort(ordered_array[:,2])
]    # create plot
    sns.scatterplot(
x=sorted_array[:,0], 
y=sorted_array[:,1], 
hue=sorted_array[:,2].astype('<U44'), 
palette='Spectral', 
alpha=0.4, 
ax=axs[1][1]
)
    # readjust labels from north to south
    location_labels = [
'ARICA', 
'TARAPACA', 
'ANTOFAGASTA', 
'ATACAMA', 
'COQUIMBO', 
'VALPARAISO', 
'METROPOLITANA',
"O'HIGGINS", 
'MAULE', 
'ÑUBLE', 
'BIOBIO', 
'ARAUCANIA', 
'LOS RIOS', 
'LOS LAGOS', 
'AYSEN',
'MAGALLANES', 
'EXTRANJERO'
]
    axs[1][1].legend(
labels = location_labels, 
ncol=4, 
fontsize='xx-small'
)

To go further

There’s a lot that this analysis does not cover, such as the reasons behind abstention. But this data could be used to provide another study about the gender-based vote, as the gender data is available except for the polling places abroad.

Comparing the gender data with the mobilization plot could be interesting because Boric is said to have won the second round thanks to a strong mobilization of young female voters.

Even though the elections took place last December and the case study comes a bit late, it will still be interesting to analyze this data with another crucial election in Chile coming soon: the Constitution Referendum on September 4th.

A translation to Spanish is available here.

A visualization of Le Monde following the last French presidential elections inspired this case study.