The world’s leading publication for data science, AI, and ML professionals.

Understanding Voting Outcomes through Data Science

After the surprising results of the 2016 presidential election, I wanted to better understand the socio-economic and cultural factors that…

See Interactive Data Viz
See Interactive Data Viz

After the surprising results of the 2016 presidential election, I wanted to better understand the socio-economic and cultural factors that played a role in voting behavior. With the election results in the books, I thought it would be fun to reverse-engineer a predictive model of voting behavior based on some of the widely available county-level data sets.

The end result is a web app that lets you dial up or down the top 10 most influential features and visualize county-level voting outcomes for your hypothetical scenario: Random Forest Model of 2016 Presidential Election:

Random Forest Model of 2016 Presidential Election

For example, if you want to answer the question "how could the election have been different if the percentage of people with at least a bachelor’s degree had been 2% higher nationwide?" you can simply toggle that parameter up to 1.02 and click "Submit" to find out.

The predictions are driven by a random forest classification model that has been tuned and trained on 71 distinct county-level attributes. Using real data, the model has a predictive accuracy of 94.6% and an ROC AUC score of 96%. When you adjust the tunable parameters and hit "Submit," the model takes in the new data and updates the predicted outcome.

It’s important to point out that the interactive visualization above is better described as "hypothetical" than "predictive." Voting Democrat or Republican isn’t a static classification; the party leaders and platforms change significantly over time, so the same model can’t be used to predict voting outcomes over the course of multiple elections. With the proper context, however, the model can yield valuable insights. By understanding how voters would have reacted to the Democratic and Republican 2016 platforms under different circumstances, we can better understand how the platforms could be adjusted to account for projected socioeconomic trends for the next 4 years. The tool is intended to be useful for exploratory Data Analysis, and the interpretation of its output is more of an art than a science.

So how does it work?

I approached this as a binary classification problem; the model needed to predict whether each county would vote Republican or Democrat. In order to create a predictive model I needed a robust "explanatory" data set (socioeconomic and cultural data) and a "response" data set (voting outcomes).

The first step in this process was scraping a number of sources for county-level data on education, population density, income, race, gender, age and religion (see Acquisition and cleaning of US county-level data sources). This required data cleaning and formatting in Python, as well as imputing missing values with weighted averages.

Once the data was combined into a single, clean data set, I tested a number of supervised learning techniques, including k-nearest neighbors, logistic regression, and random forest classification. For each model I tuned the necessary hyper parameters using a combination of grid search and cross-validation techniques:

With the hyper-parameters optimized, I evaluated the techniques based on their predictive accuracy and ROC AUC scores:

Random Forest Classification

  • ROC AUC Score: 96.1%
  • Accuracy: 93.6%

Logistic Regression

  • ROC AUC Score: 91.7%
  • Accuracy: 89.4%

K-nearest Neighbors

  • ROC AUC Score: 80.64%
  • Accuracy: 87.4%

The Random forest classifier is the most accurate for this analysis and is therefore the model used in the data visualization.

Decision Trees & Random Forests

In order to understand random forest models we first need to understand how decision trees work. A decision tree is a tree-like structure that consists of nodes and branches. Each node can be thought of as a test on a predictive feature. If the feature is quantitative (as all of ours are in this exercise) then a value is set as the split point for the feature and samples above the value follow one branch while samples below the value follow the other. In a decision tree classifier, a terminal node will result in a prediction of the class.

In our example, the features and their split points are determined by their "Gini Impurity" values. For a complete explanation of gini impurity, see Decision Trees and the Gini Index, but in short the gini impurity is the probability that a randomly selected member of the subset will be incorrectly classified (this can be thought of as the "expected error rate"). If this value is 0, it means that the subset has been completely purified; only members from one class remain. At each step, the tree prioritizes the feature whose resulting subsets yield the biggest reduction in gini impurity, known as "gini gain."

A single decision tree has the advantage of being interpretable, but can often grow quite large in order to purify the sample data. This can result in a model that fits the sample data perfectly, but doesn’t generalize well to out-of-sample data ("overfitting" with low bias / high variance). Decision trees are also "greedy" in that they look for the locally optimal decision at each node (based on gini gain), but don’t take into account combinations of features that may provide the most efficient route.

In order to get around these limitations, we can use an ensemble method called a random forest that grows multiple decision trees. Each tree uses bootsrap aggregating ("bagging") to randomly select a subset of features. The trees are allowed to grow fully, without pruning, and each node in the tree is evaluated based only on the random subset of features chosen for it. This process is repeated until a forest of trees has grown and for each data point in the testing set, the random forest model will take the majority vote from all of the decision trees.

What did we learn?

Race, Education, Religion, and Population density were the most important signals in predicting voting outcomes

In order to get a sense of the data, a great place to start is by building a random forest and visualizing the feature importances of the predictors, as determined by the features’ gini gain. Here we can see the top 10 features for predicting party winner in the 2016 Presidential Election:

Since these are quantitative variables, each feature has a split point at which the difference in voting behavior is significant. When we examine the split points for each feature, we see the following:

  • Percent of adults with a bachelor’s degree or higher, 2011–2015: 29.95%
  • Percent of adults with a high school diploma only, 2011–2015: 26.55%
  • Housing density per square mile of land area: 201.9
  • Population density per square mile of land area: 483.7
  • Population: 4,223

To better visualize the impact of these split points, it can be helpful to examine the distribution of the outcome variable above and below the split point. Here’s an example using population density – you can see that the voting trends are inverted above and below 483 people per square mile:

To examine the rest of the features, see the following Tableau workbook: 2016 Election Split Point Dashboard.

The key variable for voting outcomes in the US South is the racial makeup of the population

Here’s what the model predicts the Southern US might look like if the percentage of caucasian Americans were reduced by 25%:

Actual

Hypothetical

To explore more hypothetical scenarios, play with the following web app and see what the random forest predicts would have happened: Hypothetical Model of the 2016 US Presidential Election

Descriptive Stats

Even without creating a model from our data set, some insights can be gleaned just from doing standard exploratory analysis:

Among counties with high population density, education was a key factor in voting outcomes

Counties with population densities above ~200 people per square mile tended to vote Democrat. Among those that did vote Republican, the percentage of the population with a bachelor’s degree or higher was over 12.5% lower than Democratic counties (See Split points with Heat Map):

Counties with older populations saw higher voter turnout

Average age and voter turnout were positively correlated for both Republican and Democratic parties. Average age is a tricky variable though, as it doesn’t exclude those below voting age, which means that counties with a high number of young families might have higher average age of voters but a lower average age overall:

See the following Tableau dashboard to explore more relationships among the explanatory variables: 2016 Presidential Election – Data Exploration

Counties with higher voter turnout tended to vote Republican

Key battleground states, including Michigan, Ohio, and Wisconsin, saw high voter turnout overall, but especially from the Republican base:

Counties with low turn-out voted Democrat

Democrats tended to win the popular vote in counties with the lowest voter turnout. This was especially true of densely populated counties in the Southwest. This was more likely to happen in non-battleground counties, as voters were less engaged and likely viewed the outcomes as inevitable:

Explore on your own

Interactive visualization built using D3 and Flask

Tableau Workbooks for exploratory analysis

  • Split Points of Top 10 Features: In this workbook, you can view several split points for features of high importance and examine voting trends above and below the split point
  • Split Points with Heat Map: Same as above but a heat map of the US replaces the histogram. The heat map can be filtered by clicking the pie chart.
  • Population Distribution by Voter Turnout: Shows population distribution by voter turnout rate as a histogram. The histogram serves as a filter for the bar chart.
  • Interrelationships Among Explanatory Variables: This workbook allows you to choose variables for the x and y axes, as well as apply filters based on the split points in the previous workbook. This allows us to examine the interplay of several features at once.

Jupyter Notebooks


Related Articles