Feature Engineering for Election Result Prediction (in Python)

Thamindu Dilshan Jayawickrama
Towards Data Science
13 min readApr 6, 2020

--

Photo by James Harrison on Unsplash

I recently competed in a Kaggle competition where we have to predict the election results using machine learning. The dataset was from the 2019 India general election (see here). This article explains how to clean and prepare the dataset, create new features out of the existing ones and then predict the results using a popular machine learning algorithm. Most of the basic preprocessing, data visualization and machine learning steps haven’t clearly explained here; Rather I focused on feature engineering and how it affects the performance of the models. If you don’t have a clear understanding of what is feature engineering, please refer to my previous article on “Basic Feature Engineering to Reach More Efficient Machine Learning”.

First, we should load the dataset into a notebook. Most ML engineers and experts prefer to use notebooks for machine learning and data engineering tasks in the initial stage because it is really easy to do data visualization and all other steps with a notebook. I prefer to use pandas library to load the dataset because it makes all the data processing steps really easy without much effort. Below are the python libraries I used in the notebook.

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest, chi2

First, let’s run below commands to have a basic idea about the dataset and how the raw data look like.

dataset = pd.read_csv('/kaggle/input/indian-candidates-for-general-election-2019/LS_2.0.csv')
dataset.head()
dataset.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2263 entries, 0 to 2262
Data columns (total 19 columns):
STATE 2263 non-null object
CONSTITUENCY 2263 non-null object
NAME 2263 non-null object
WINNER 2263 non-null int64
PARTY 2263 non-null object
SYMBOL 2018 non-null object
GENDER 2018 non-null object
CRIMINAL
CASES 2018 non-null object
AGE 2018 non-null float64
CATEGORY 2018 non-null object
EDUCATION 2018 non-null object
ASSETS 2018 non-null object
LIABILITIES 2018 non-null object
GENERAL
VOTES 2263 non-null int64
POSTAL
VOTES 2263 non-null int64
TOTAL
VOTES 2263 non-null int64
OVER TOTAL ELECTORS
IN CONSTITUENCY 2263 non-null float64
OVER TOTAL VOTES POLLED
IN CONSTITUENCY 2263 non-null float64
TOTAL ELECTORS 2263 non-null int64
dtypes: float64(3), int64(5), object(11)
memory usage: 336.0+ KB

We can see from the above analysis, the dataset contains few numerical columns and most of the columns are non-numeric. WINNER column contains the label indicating that either the candidate have won or lost the election. Also, note that the dataset contains some ‘NaN’ values which are basically missing values. Some column names contain ‘\n’ characters, which makes it annoying. Also, ASSETS and LIABILITIES column values contain ‘\n’ characters as well.

By considering only the available data, we can consider state, constituency, party, gender, category and education as categorical features. Additionally, you can run dataset.describe() command to have some statistical summary on numerical columns.

Next, we should clean the dataset, fix column names and treat the missing values. First I fixed the incorrect column names and replace all the spaces in column names with underscores (‘_’). This is not a mandatory step, but incorrect column names make it annoying to me.

# rename invalid column names
dataset = dataset.rename(columns={'CRIMINAL\nCASES': 'CRIMINAL_CASES', 'GENERAL\nVOTES': 'GENERAL_VOTES', 'POSTAL\nVOTES': 'POSTAL_VOTES', 'TOTAL\nVOTES': 'TOTAL_VOTES', 'OVER TOTAL ELECTORS \nIN CONSTITUENCY': 'OVER_TOTAL_ELECTORS_IN_CONSTITUENCY', 'OVER TOTAL VOTES POLLED \nIN CONSTITUENCY': 'OVER_TOTAL_VOTES_POLLED_IN_CONSTITUENCY', 'TOTAL ELECTORS': 'TOTAL_ELECTORS'})

Then let’s search for the missing values in each column.

dataset.isna().sum()STATE                                        0
CONSTITUENCY 0
NAME 0
WINNER 0
PARTY 0
SYMBOL 245
GENDER 245
CRIMINAL_CASES 245
AGE 245
CATEGORY 245
EDUCATION 245
ASSETS 245
LIABILITIES 245
GENERAL_VOTES 0
POSTAL_VOTES 0
TOTAL_VOTES 0
OVER_TOTAL_ELECTORS_IN_CONSTITUENCY 0
OVER_TOTAL_VOTES_POLLED_IN_CONSTITUENCY 0
TOTAL_ELECTORS 0
dtype: int64

You can see that 10% of the row values are missing. There are multiple ways to treat missing values like deleting them, using back-fill or forward-fill, constant value imputation, mean/ median or mode imputation, etc. However, I just delete these rows here for the simplicity (only 10% is missing), but always remember that removing values will make the prediction model less accurate. You should try to impute missing values as much as possible.

# drop rows with NA values
dataset = dataset[dataset['GENDER'].notna()]

Let’s convert ASSETS, LIABILITIES and CRIMINAL_CASES columns numeric because they represent money and count, and numeric will make sense to the models. In order to do that, we have to remove the ‘Rs’ sign, ‘\n’ character and commas in each value field. But also those columns contain ‘Nil’ and ‘Not Available’ as values. So before making them numeric, we have to replace those values with some meaningful value (for the moment I replaced them with 0).

# replace Nil values with 0
dataset['ASSETS'] = dataset['ASSETS'].replace(['Nil', '`', 'Not Available'], '0')
dataset['LIABILITIES'] = dataset['LIABILITIES'].replace(['NIL', '`', 'Not Available'], '0')
dataset['CRIMINAL_CASES'] = dataset['CRIMINAL_CASES'].replace(['Not Available'], '0')

# clean ASSETS and LIABILITIES column values
dataset['ASSETS'] = dataset['ASSETS'].map(lambda x: x.lstrip('Rs ').split('\n')[0].replace(',', ''))
dataset['LIABILITIES'] = dataset['LIABILITIES'].map(lambda x: x.lstrip('Rs ').split('\n')[0].replace(',', ''))

# convert ASSETS, LIABILITIES and CRIMINAL_CASES column values into numeric
dataset['ASSETS'] = dataset['ASSETS'].astype(str).astype(float)
dataset['LIABILITIES'] = dataset['LIABILITIES'].astype(str).astype(float)
dataset['CRIMINAL_CASES'] = dataset['CRIMINAL_CASES'].astype(str).astype(int)

Now let’s make non-numeric columns numeric for better performance. Note that some model types cannot perform with non-numerical data. Here I’m focusing on a classification algorithm, which certainly cannot train on non-numerical data. So I label encoded those non-numerical columns using sklearn LabelEncoder.

# label encode categorical columns

lblEncoder_state = LabelEncoder()
lblEncoder_state.fit(dataset['STATE'])
dataset['STATE'] = lblEncoder_state.transform(dataset['STATE'])

lblEncoder_cons = LabelEncoder()
lblEncoder_cons.fit(dataset['CONSTITUENCY'])
dataset['CONSTITUENCY'] = lblEncoder_cons.transform(dataset['CONSTITUENCY'])

lblEncoder_name = LabelEncoder()
lblEncoder_name.fit(dataset['NAME'])
dataset['NAME'] = lblEncoder_name.transform(dataset['NAME'])

lblEncoder_party = LabelEncoder()
lblEncoder_party.fit(dataset['PARTY'])
dataset['PARTY'] = lblEncoder_party.transform(dataset['PARTY'])

lblEncoder_symbol = LabelEncoder()
lblEncoder_symbol.fit(dataset['SYMBOL'])
dataset['SYMBOL'] = lblEncoder_symbol.transform(dataset['SYMBOL'])

lblEncoder_gender = LabelEncoder()
lblEncoder_gender.fit(dataset['GENDER'])
dataset['GENDER'] = lblEncoder_gender.transform(dataset['GENDER'])

lblEncoder_category = LabelEncoder()
lblEncoder_category.fit(dataset['CATEGORY'])
dataset['CATEGORY'] = lblEncoder_category.transform(dataset['CATEGORY'])

lblEncoder_edu = LabelEncoder()
lblEncoder_edu.fit(dataset['EDUCATION'])
dataset['EDUCATION'] = lblEncoder_edu.transform(dataset['EDUCATION'])

Now let’s train a K-Nearest Neighbors model and see the accuracy. KNN is a supervised machine learning model which is categorized under classification algorithms. The algorithm works by taking a data point and finding out the k closest data points.

# separate train features and label
y = dataset["WINNER"]
X = dataset.drop(labels=["WINNER"], axis=1)
# split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
# train and test knn model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.predict(X_test)
print("Testing Accuracy is: ", knn.score(X_test, y_test)*100, "%")
Testing Accuracy is: 70.79207920792079 %

The model has achieved 70% accuracy without much effect. Let’s normalize the dataset and see how the accuracy improves. I have used MinMaxScaler from the scikit-learn library to scale down all the values into the 0–1 range.

# scaling values into 0-1 rangescaler = MinMaxScaler(feature_range=(0, 1))
features = [
'STATE', 'CONSTITUENCY', 'NAME', 'PARTY', 'SYMBOL', 'GENDER', 'CRIMINAL_CASES', 'AGE', 'CATEGORY', 'EDUCATION', 'ASSETS', 'LIABILITIES', 'GENERAL_VOTES', 'POSTAL_VOTES', 'TOTAL_VOTES', 'OVER_TOTAL_ELECTORS_IN_CONSTITUENCY', 'OVER_TOTAL_VOTES_POLLED_IN_CONSTITUENCY', 'TOTAL_ELECTORS']
dataset[features] = scaler.fit_transform(dataset[features])

As you can see below, accuracy has improved well by just normalizing the column values. However, we could further improve accuracy by applying a few other feature engineering techniques.

Testing Accuracy is:  90.5940594059406 %

Encode Existing Features more Meaningfully

If we consider the EDUCATION column, it contains 11 categorical values which a particular candidate can have.

Illiterate, Literate, 5th Pass, 8th Pass, 10th Pass, 12th Pass, Graduate, Post Graduate, Graduate Professional, Doctorate, Other

If we think thoroughly, each of these value represents a particular level of education. By label encoding, we just assign some random integer for each of these values without thinking about their hierarchical level. However, if we assign that integer meaningfully according to the educational qualification, hopefully, the model will perform better. Note that there is a field value named as “Other”, which we don’t know the hierarchical position. I just assign a median value from the range for it.

dataset['EDUCATION'].value_counts()Post Graduate            502
Graduate 441
Graduate Professional 336
12th Pass 256
10th Pass 196
8th Pass 78
Doctorate 73
Others 50
Literate 30
5th Pass 28
Not Available 22
Illiterate 5
Post Graduate\n 1
Name: EDUCATION, dtype: int64
# encode education column
encoded_edu = []
# iterate through each row in the dataset
for row in dataset.itertuples():
education = row.EDUCATION
if education == "Illiterate":
encoded_edu.append(0)
elif education == "Literate":
encoded_edu.append(1)
elif education == "5th Pass":
encoded_edu.append(2)
elif education == "8th Pass":
encoded_edu.append(3)
elif education == "10th Pass":
encoded_edu.append(4)
elif education == "12th Pass":
encoded_edu.append(7)
elif education == "Graduate":
encoded_edu.append(8)
elif education == "Post Graduate":
encoded_edu.append(9)
elif education == "Graduate Professional":
encoded_edu.append(10)
elif education == "Doctorate":
encoded_edu.append(11)
else:
encoded_edu.append(5)
dataset['EDUCATION'] = encoded_edu

Let’s have some analysis of the PARTY column. If we list the number of candidates in front of each party, we can see that only a few parties have a significant number of candidates. The whole purpose of including the PARTY column is to have the impact of party of the candidate for the winning of the election. If the party doesn’t have a significant number of candidates, impact of that party for the winning of the candidate is low. So we can encode all of them into one common category (I encoded them as “other”).

dataset['PARTY'].value_counts()BJP       420
INC 413
IND 201
BSP 163
CPI(M) 100
...
AINRC 1
SKM 1
ANC 1
YKP 1
AJSUP 1
Name: PARTY, Length: 132, dtype: int64
# change party of the less frequent parties as Other
# 'BJP','INC','IND','BSP', 'CPI(M)', 'AITC', 'MNM': high frequent
# 'TDP', 'VSRCP', 'SP', 'DMK', 'BJD': medium frequent
dataset.loc[~dataset["PARTY"].isin(['BJP','INC','IND','BSP', 'CPI(M)', 'AITC', 'MNM', 'TDP', 'VSRCP', 'SP', 'DMK', 'BJD']), "PARTY"] = "Other"
dataset['PARTY'].value_counts()

Both of these steps should perform before label encoding columns. Let’s train our model again and see the accuracy.

Testing Accuracy is:  92.07920792079209 %

You can see that the accuracy has improved little more. Until now, we have only considered the existing features. However, we can make new features out of the existing ones.

Making New Features

In order to make new features, we should have a general idea on available raw features and how they affect the given scenario. Moreover, a good understanding of the problem domain is required. So here, we should think about what are the factors affecting the winning of a candidate in an election and what are the considerations for us to vote for a candidate or a party. For the given scenario, I’m not aware of the country factors they consider before voting for a candidate (what country-specific factors affect a candidate to win). For example, I have no idea how the CATEGORY affects election results. But I have come up with the following features, by assuming the general use case.

The dataset contains a lot of raw features, which will directly affect the winning chance of a candidate (For example, state, constituency, party, criminal records, educational qualifications, assets, etc). However, a particular candidate could also be able to win an election depending on the status of the nominated party. Below are some features that could represent the importance of the party for the winning probability of a candidate.

1. Total number of seats won by the party

We might consider the party of the candidate before voting him/ her. If a party has a high chance of winning, a candidate from that party might have the same chance to win (not true always). We can highlight that scenario by making a feature representing the number of seats won by the party (winning probability will be highlighted by the count).

2. Number of seats won in the constituency by the party

There can be some parties which will always win a particular state or a constituency. I have considered constituency here because it is the smallest area available.

3. Criminal cases against the party

We will often consider the number of criminal records against a party before voting. If a party has a higher number of criminal cases (higher number of corrupt politicians), that party might have a lesser chance to win the election (or vice versa ;-) ). We can highlight this scenario by making a feature representing the number of criminal records against a party.

4. Number of criminal cases for the party in each constituency

We can also divide the above scenario into subcases by considering criminal records for each constituency. That might represent the above feature well.

5. Education level of the candidates in the party

We will also consider the education qualifications of the candidates of the party, before voting them. If a party nominates more number of educated candidates, that party might win the election (higher probability). We have encoded educational qualifications meaningfully in an earlier step. So if we take the count of each candidate in a party, that feature will represent the educational level of the party.

6. Education level of the candidates in the party for each constituency

As we did with the criminal cases, we can also consider the educational level of a party for each constituency.

7. Number of candidates for the constituency by the party

As I have already mentioned, a particular party might have a higher chance of winning depending on the constituency. If a party nominates a higher number of candidates for a constituency, votes will be divided among candidates and the percentage of each candidate will be less. That will directly affect the winning probability.

Below are some more features I created that could affect the winning probability of a candidate.

1. Total number of voters per each state

2. Total number of voters in each constituency

3. Total number of constituencies per each state

So I have created a total of 10 new features based on the existing ones. Some of these features may not affect the performance of the model.

# Preparing feature valuescons_per_state = {}
voters_per_state = {}
party_winningSeats = {}
party_criminal = {}
party_education = {}
party_totalCandidates_per_cons = {}
party_winningSeats_per_cons = {}
party_criminal_per_cons = {}
party_education_per_cons = {}
voters_per_cons = {}
# group by state
subset = dataset[['STATE', 'CONSTITUENCY', 'TOTAL_ELECTORS']]
gk = subset.groupby('STATE')
# for each state
for name,group in gk:
# total constituencies per state
cons_per_state[name] = len(group)

# total voters per state
voters_per_state[name] = group['TOTAL_ELECTORS'].sum()
# group by party
subset = dataset[['PARTY', 'CONSTITUENCY', 'CRIMINAL_CASES', 'EDUCATION', 'WINNER']]
gk = subset.groupby('PARTY')
# for each party
for name,group in gk:
# winning seats by party
party_winningSeats[name] = group[group['WINNER'] == 1.0].shape[0]

# criminal cases by party
party_criminal[name] = group['CRIMINAL_CASES'].sum()

# education qualification by party (sum of candidates)
party_education[name] = group['EDUCATION'].sum()

# group by constituency
gk2 = group.groupby('CONSTITUENCY')

# for each constituency
for name2, group2 in gk2:
key = name2 + '_' + name # cons_party

# total candidates by party in constituency
party_totalCandidates_per_cons[key] = len(group2)

# party winning seats in the constituency
party_winningSeats_per_cons[key] = group2[group2['WINNER'] == 1.0].shape[0]

# criminal cases by party in the constituency
party_criminal_per_cons[key] = group2['CRIMINAL_CASES'].sum()
# education qualification by party in constituency (sum of candidates)
party_education_per_cons[key] = group2['EDUCATION'].sum()
# Total voters per constituency
subset = dataset[['CONSTITUENCY', 'TOTAL_ELECTORS']]
gk = subset.groupby('CONSTITUENCY')
# for each constituency
for name,group in gk:
voters_per_cons[name] = len(group)
# Applying feature values# new feature columns
total_cons_per_state = []
total_voters_per_state = []
total_voters_per_cons = []
winning_seats_by_party = []
criminal_by_party = []
education_by_party = []
total_candidates_by_party_per_cons = []
winning_seats_by_party_per_cons = []
criminal_by_party_per_cons = []
education_by_party_per_cons = []
# iterate through each row in the dataset
for row in dataset.itertuples():
subkey = row.CONSTITUENCY + '_' + row.PARTY
total_cons_per_state.append(cons_per_state.get(row.STATE))
total_voters_per_state.append(voters_per_state.get(row.STATE))
total_voters_per_cons.append(voters_per_cons.get(row.CONSTITUENCY))
winning_seats_by_party.append(party_winningSeats.get(row.PARTY))
criminal_by_party.append(party_criminal.get(row.PARTY))
education_by_party.append(party_education.get(row.PARTY))
total_candidates_by_party_per_cons.append(party_totalCandidates_per_cons.get(subkey))
winning_seats_by_party_per_cons.append(party_winningSeats_per_cons.get(subkey))
criminal_by_party_per_cons.append(party_criminal_per_cons.get(subkey))
education_by_party_per_cons.append(party_education_per_cons.get(subkey))
# append columns to dataset
dataset['total_cons_per_state'] = total_cons_per_state
dataset['total_voters_per_state'] = total_voters_per_state
dataset['total_voters_per_cons'] = total_voters_per_cons
dataset['winning_seats_by_party'] = winning_seats_by_party
dataset['criminal_by_party'] = criminal_by_party
dataset['education_by_party'] = education_by_party
dataset['total_candidates_by_party_per_cons'] = total_candidates_by_party_per_cons
dataset['winning_seats_by_party_per_cons'] = winning_seats_by_party_per_cons
dataset['criminal_by_party_per_cons'] = criminal_by_party_per_cons
dataset['education_by_party_per_cons'] = education_by_party_per_cons

Let’s train our model again and see how the accuracy improves. No need to label encode any of the created features as all of them are numeric. However, I normalize them into 0–1 range before training the model.

Testing Accuracy is:  96.78217821782178 %

Feature Importance

Removing irrelevant features or less important features will improve the accuracy of a model. So far we have trained our latest model on 28 total features including the newly created 10. However, some of these features may not contribute a lot to derive conclusions based on the training data. Removing those features will not decrease the accuracy of the model, hopefully, will increase it, because irrelevant features may draw invalid inferences for the model.

First, we can remove some features by just analyzing their contribution. The NAME column won’t make any useful inferences for the model, because ideally, the name should be unique. However, the surname can be important in some cases, because some family names might have an impact on winning an election. Also the PARTY and SYMBOL both variables will represent the same feature, and we will be able to remove one of them without any impact on the accuracy. TOTAL_VOTES column contains the sum of GENERAL_VOTES column and POSTAL_VOTES column. So we can remove those two as well. If we plot the heat map representing the correlation matrix, we will see that those 3 features will be highly correlated.

# remove unnecessary columnsX.drop(labels=["NAME"], axis=1, inplace=True)
X.drop(labels=["SYMBOL"], axis=1, inplace=True)
X.drop(labels=["POSTAL_VOTES"], axis=1, inplace=True)
X.drop(labels=["GENERAL_VOTES"], axis=1, inplace=True)

There are a few techniques available to select features based on their importance. I will use the univariate selection method to identify the most important features and remove others. I’ll be using SelectKBest from the Scikit learn library with chi-squared test to evaluate the feature importance.

# apply SelectKBest class to extract top most features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X, y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
# concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns, dfscores], axis=1)
featureScores.columns = ['Specs', 'Score']
print(featureScores.nlargest(30, 'Score'))
Specs Score
25 winning_seats_by_party_per_cons 486.207788
16 OVER_TOTAL_VOTES_POLLED_IN_CONSTITUENCY 285.347141
15 OVER_TOTAL_ELECTORS_IN_CONSTITUENCY 262.177373
14 TOTAL_VOTES 216.937788
12 GENERAL_VOTES 216.138799
21 winning_seats_by_party 199.662525
13 POSTAL_VOTES 65.126864
22 criminal_by_party 36.519437
3 PARTY 35.416433
23 education_by_party 6.576548
11 LIABILITIES 6.330339
24 total_candidates_by_party_per_cons 5.755538
20 total_voters_per_cons 5.302656
4 SYMBOL 4.128283
8 CATEGORY 4.047031
10 ASSETS 3.755575
7 AGE 2.077768
9 EDUCATION 0.888330
27 education_by_party_per_cons 0.840185
18 total_cons_per_state 0.481673
6 CRIMINAL_CASES 0.436667
19 total_voters_per_state 0.292948
26 criminal_by_party_per_cons 0.178720
5 GENDER 0.145870
1 CONSTITUENCY 0.143250
0 STATE 0.076833
17 TOTAL_ELECTORS 0.054486
2 NAME 0.003039

Let’s drop all the columns with a score of less than 3. Note that some of the features I created earlier are also not important for the KNN model.

X.drop(labels=["TOTAL_ELECTORS"], axis=1, inplace=True)
X.drop(labels=["STATE"], axis=1, inplace=True)
X.drop(labels=["CONSTITUENCY"], axis=1, inplace=True)
X.drop(labels=["GENDER"], axis=1, inplace=True)
X.drop(labels=["criminal_by_party_per_cons"], axis=1, inplace=True)
X.drop(labels=["total_voters_per_state"], axis=1, inplace=True)
X.drop(labels=["CRIMINAL_CASES"], axis=1, inplace=True)
X.drop(labels=["total_cons_per_state"], axis=1, inplace=True)
X.drop(labels=["EDUCATION"], axis=1, inplace=True)
X.drop(labels=["education_by_party_per_cons"], axis=1, inplace=True)
X.drop(labels=["AGE"], axis=1, inplace=True)

Now let’s train our model again with the most important features and evaluate the accuracy.

Testing Accuracy is:  99.5049504950495 %

Conclusion

As you can see from the above results, feature engineering steps have increased the performance of the model drastically. We have reached an accuracy of 0.99 starting with an initial accuracy of 0.7. First, we clean the dataset and convert all the values into a numerical format. Then we create some new features using the existing ones and remove all less important features. Finally, we scale down the values into 0–1 range and train a K-Nearest Neighbors model. Note that we could further improve the accuracy by applying cross-validation. As a conclusion, I would like to summarize the accuracy after each major step as below.

Feature Engineering Step                   Testing AccuracyInitially without any data processing                0.7079
Scale down values into 0-1 range 0.9059
Basic encoding of two columns 0.9208
Creating new features 0.9678
Removing unnecessary features 0.9950

--

--

Senior Software Engineer at WSO2 LLC. | B. Sc in Engineering (Hons), Computer Science and Engineering, University of Moratuwa