Trader Joe’s Democrats and Walmart Republicans

Modeling US elections using chain stores

Aaron Lee
Towards Data Science

--

If your county has more Dollar Tree stores than Starbucks, you’re likely a Republican. If you have even a single Trader Joe’s store in your county, you’re probably a Democrat.

Photo by Elliott Stallion on Unsplash

The chain stores found in a community can tell us a lot about the people that live there. Think about Target and Walmart; what political party do you associate with each? I wanted to investigate that perception, so I built an election model based solely on the number of chain stores in your area.

My hypothesis is that the number of total chain stores will correlate with an increase in Democratic voting as chains are more frequent in urban areas. Additionally, the brands (identified later in this story) indicate a particular party lean for either Republican or Democratic voters.

My Election Data

To investigate this problem, I decided to build a classification model using the electoral results from the 2018 midterm elections.

The election data for this project came from the great people at MIT Election Lab. I supplemented the county information with political boundaries files (geojson) and other county information from data.gov so that I could create maps using Plotly.

I chose to look at county level data. There are more than 3000 counties in the United States, which I felt would be plenty to build a robust machine learning model, rather than looking at state or congressional district data. I designated each county as either Democratic or Republican based on the total number of votes for all party candidates (both local and national). Unfortunately, Alaska was not included in my election data (my apologies to the boroughs of Alaska).

About US Counties

In the United States, most states are divided into political and administrative districts called counties. Most of the 3000+ counties in the United States are Republican, making my dataset skewed at the outset. Democratic counties number around 600 and are generally clustered in and around major cities.

Although Democratic counties only account for 20% of the total counties, nearly 60% of the US population live in them. It’s also where we find a majority of the retail locations used in this project.

Chain Store Data

After some research about the politics of brands, I found that a lot has been written on the subject, and I recommend checking out some of the work (NBC, Time, Washington Post, and The New York Times).

I chose 20 features (retail chains) based on information found in my research, and some for my own curiosity. The stores I investigated are shown in the graphic below (features not used in the final model are grayed out).

Selected features for my project

All features are national chains that have expanded to nearly or all of the United States. I aimed for a balance between perceived Republican and Democratic leaning stores as based on my research.

Now all I needed was the location of each store, and I could begin the project in earnest. Some data was readily available from Kaggle and similar sites, but for many of the store chains, I had to resort to creative and ethical web scraping to compile the location data. This part took me more time than I care to admit; web scraping is a challenging sport sometimes.

In all, I collected the Lat/Long information for more than 45,000 store locations from 20 chains. Using my shape files obtained from data.gov and Python’s Shapely library, I was able to determine the correct county for each of the stores and compile a single dataset containing the count of each chain in every US county. I then added in the election data to determine which way those counties voted in 2018.

In the graph below, we can see all of the brands I chose, and where a majority of their stores are located.

Each chain store’s red/blue distribution

Preliminary Mapping

First, I plotted the results of the 2018 election on a choropleth map. As expected, Democratic (blue) counties are predominantly around population centers, particularly along both coasts. Blue dots speckle the sea of Republican red across the middle of the country, but most large cities show a splash of blue.

US Midterm Elections by County (2018)

I also plotted choropleth maps showing the density of each chain store across all counties. The two below show extremes of store distributions. More than 93% of Whole Foods locations are located in Democratic counties in major population centers. Moreover, a major metropolitan area like Cook County (Chicago) has 15 of the 500 Whole Foods locations in America. Los Angeles County has 26.

Whole Foods Choropleth map

Compare that with the 1900 Tractor Supply Company locations which seem to have consistent coverage across rural areas, with only minimal penetration into urban markets. If you’ve never heard of or been to a Tractor Supply Company store, it’s likely that you live a large urban area. Chicago’s Cook County has zero Tractor Supply Co. stores, and Los Angeles County (population 10 million) has just one one. — For comparison, Lincoln, Nebraska (population 330,000) has three.

Tractor Supply Choropleth

Certain chains that are over or underrepresented in urban and rural areas. Since our political divide falls at least partly on this urban/rural split, these features should be useful to our model.

Our Model

To simplify the model and account for the different physical and population sizes of each county, I calculated the number of stores per unit area for each county. Instead of basing the model on the number of stores, it is the number of stores per square mile. This makes my data continuous and a little easier to model. It also accounts for some of the rather large counties as you look West. (a similar approach would be to calculate the stores per population)

Model at a glance:

  • Random Forest (tree based algorithm)
  • 10 features selected using the Boruta Python library algorithm
  • Weighted to account for imbalance in red/blue counties
  • Tuned to achieve maximum accuracy
  • Used test/train split of 0.25

The final model achieved an accuracy of 85.7%.

Feature Importance

The most important features are shown below. When looking at individual trees in the random forest model, I determined the direction of the features (does more of this feature indicate red or blue states?). If the presence of more stores helped discriminate a county to be Republican, they have a red box around them (blue for Democratic).

The top four features were ones used to identify Democratic counties. The most important ‘Republican’ feature was Walmart, and the most important ‘Democratic’ feature was Starbucks. In general, features with more stores provided more benefit to the model.

Our model values Democratic ‘upmarket’ brands (Starbucks, Trader Joe’s and Whole Foods), and Republican ‘value’ brands (Dollar Tree, Walmart, Tractor Supply Co.).

Confusion matrix

When making predictions with the test data (n=776), we see that the model leaned a little towards predicting Republican. 92% of the Republican counties were identified correctly, compared to only 52% of Democratic counties. Democratic counties are the minority so we would expect it to be the more difficult task for the model.

The link above gives a great way of simply measuring the effectiveness of your model. If we were to randomly guess the politics of each county at a 81% to 19% rate (the actual percentages of Dem/GOP), we would only achieve a 69% accuracy. Our model gains an additional 17% through machine learning.

Model Prediction

Our model predicts almost 86% of the counties correctly. Let’s look at the map to see what we missed, and identify the weaknesses in our model.

Our model predicts:

Model prediction for 2018 midterm election

The actual results:

Where the model missed:

Incorrect predictions from our model

There are clear areas on our map the model missed significant numbers of counties. Some areas stand out:

  • In the New England area, particularly Vermont and New Hampshire, Democrats significantly outperformed the model. Rural areas in these states have a Democratic lean which goes against the model which assumes rural areas to be Republican.
  • New Mexico is an outlier. It is one of three states where White voters are the minority. Hispanic voters in New Mexico have shown enthusiastic support for Democrats in recent years.
  • Western Mississippi has one of the greatest concentrations of African American voters in the US, and they overwhelmingly (90%) support the Democratic party. Similar demographics exist in parts of Alabama which likely account for those misses.
  • The model also misses counties in the Pacific Northwest and the Deep South where conservatism and demographics play outsized roles in the voting habits.
  • Other big misses occurred in swing states of Florida and Wisconsin where political leanings are split and in flux.

This Washington Post piece does a nice job covering the politics of each state just before the 2018 election.

Conclusion

Brands have been politicized more than ever in the past few years, and every day new companies are deciding to test the waters, letting their politics be known. Just look at outwardly political corporations like Nike, Ben & Jerry’s, Hobby Lobby, Chick-fil-A, and Patagonia. We can’t seem to separate our shopping and our politics, and it is assumed that this binary classification of stores and brands is likely to strengthen as the political divide grows stronger. A model like this will likely improve over time.

The model and methodology has the advantage of being independent of any census or polling. It is a ‘fundamentals-based’ approach. The data is also readily available and updates more frequently than census data, allowing a political party to nimbly respond to demographic changes due to natural disasters, climate migration, and a swiftly changing economy. Capitalism, with the opening and closing of businesses, responds quickly to market changes, and a political party can take advantage of that.

At 85.7%, the model can help a political party identify areas where they are underperforming the predicted model. These identified areas may warrant interventions and investment as potentially flippable districts.

Next Steps

This method can be improved through the following methods:

  • Build several smaller targeted models based on regional or more homogenous voting blocks (Midwest, Arizona/New Mexico, California, Pacific Northwest, Florida etc.)
  • Evaluate data at the state level, congressional district level, or down to the precinct level with similar methodology. A preliminary model at the congressional district level predicted US House races with over 80% accuracy.
  • Add or change features that target the shortcomings of our model. Find what businesses are prevalent in the missed counties.
  • Build models specifically for identifying purple districts where further investment might yield political seats.

The code for this project can be found at the repo below. Thanks for reading!

--

--