Getting Started

In the movie 8 Mile starring Eminem, the whole idea of the movie is captured by the song Lose Yourself. You know the one I’m referring to. The one that starts like this…
"His palms are sweaty, knees weak, arms are heavy.
There’s vomit on his sweater already, mom’s spaghetti.
He’s nervous, but on the surface he looks calm and ready"
Yeah, that’s how I felt when I sat down for the first time at my new cube in my new job. Name plate affixed with the coveted title "data scientist." Looking confident but feeling…well…nervous.
A little dramatic? Maybe.
But that feeling ain’t got nothing on how I felt when my manager gave me my first project assignment.
"Hey Brandon. Welcome to the company. Will you now develop our enterprise micro-segmentation model? Everyone is really excited to have you on board. Bye now."
I went from nervous to straight scared. My palms weren’t just sweaty, they were soaking wet. Why was I so insanely nervous? Well, here are just a few reasons.
One, I lived in the world of regression models in academia. I had never modeled data with cluster analysis before, and cluster analysis is necessary for segmentation. Yeah, I know…total noob…seriously, how did I even land this job?
Two, and most importantly here, I was handed the equivalent of a SELECT * FROM TABLE_A, TABLE_B, TABLE_C INNER JOIN blah, blah, blah data set that contained over 3 million rows and over 200 columns.
Sure, build a meaningful cluster analysis with 200 variables? Right? I’ll be in the back screaming.
As a quick aside, if you’ve been following me, you may also know that I actually tried to analyze something like 20 or 30 at a time but quickly learned it was painfully, and utterly meaningless.
Anyway, if it weren’t for the following EDA techniques I might still be working on that problem. The primary goal of this particular use case is to reduce the number of dimensions available in the data.
Thus, we want to evaluate the potential information value that each column might have for a data science problem like this. Because models are built to model variation in data, much of our data reduction and clean-up revolves around establishing whether variables have good variation.
In this article, I take you through a typical EDA process that provides some common tools used to perform data reduction using Python and Pandas. Ultimately these efforts allow us to prep data for more sophisticated data modeling at latter stages of the data science lifecycle. Let’s get started!
Herein we use a common tabular-style data set, with rows and columns, to apply a series of important tasks that help us to develop a deeper understanding of the potential value of different columns in the data.
Step 0: Load the Data
Seems simple enough, but one thing to consider when doing Data Analysis, is the importance of getting the data needed loaded into memory. Because the computers we are using have limited capacities, it is important to ensure we size our system appropriately for data analysis.
If working with data that are too large for your systems memory, it is important to either use down sampling to pull a random sample of the data or leverage other techniques that are covered well in this article here.
import pandas as pd
import numpy as np
path = '/path/to/data/'
df = pd.read_csv(path+'data.csv')
Here is what the data we are working with looks like:

Step 1: Signals
Once the data are in memory, we need meta-data signals that allow us to quickly identify the information value of variables. The initial checklist should include examining the data types of each column.
I like to build a dataframe that contains all the meta-data signals that allow me to quickly identify meaningful variables.
In the code below, we first create a dataframe where each row is a column in the original dataframe. Our code then proceeds to add different meta-data signals to those variables such as the number of null values, the percent of null values, and descriptive statistics using the describe() method. Here is a complete list of signals we have included into our meta-data dataframe:
- Variable name
- Number of Nulls
- Percent of Nulls
- Data types
- Count
- Mean
- Standard deviation
- Minimum value
- 25%, 50%, and 75% quartiles
- Maximum values
- The value for the variable that occurs the most frequently
- The percent of the time that the most frequent value for that variable occurs
meta_df = df.isna().sum().reset_index()
meta_df['percent'] = meta_df[0]/len(df)
meta_df = pd.concat([meta_df, pd.DataFrame(df.dtypes, columns=['dtypes']).reset_index(drop=True)], axis=1)
d = df.describe().T.reset_index()
meta_df = meta_df.merge(d, on=['index'], how='left')
vcs = [pd.DataFrame(df[x].value_counts(normalize=True)).T for x in
list(df.columns)]
vcs= [pd.DataFrame((x.idxmax(axis=1), x.max(axis=1))).T.reset_index() for x in vcs if len(list(x.columns)) > 0]
meta_df = meta_df.merge(pd.concat(vcs), on=['index'], how='left')
meta_df.columns = ['index', '0_x', 'percent', 'dtypes', 'count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max', 'max_value','max_percent']
Once our meta-data dataframe is built, we then build a function that includes a series of rules for determining whether any variable might be bad.
The first rule ensures that there is at least some variability, and so we look for any columns with standard deviations equal to 0.
The second rule ensures some distribution of data, and so we compare the quartiles to ensure they are not all equal to the same value.
The next rule flags variables that have more than 10% null values, as per this recommendation here. Similarly, we also flag variables that have a single value that occurs more than 80% of the time.
Finally, we also flag the variable if no max value is present, meaning the variable had no values. Obviously these rules can be tailored to other use-cases with different values or signals.
def bad_flag(row):
if row['std'] == 0:
return True
elif (row['25%'] == row['50%']) and (row['50%'] == row['75%']):
return True
elif row['percent'] > .10:
return True
elif row['max_percent'] > .80:
return True
elif pd.isnull(row['max_percent']):
return True
else:
return False
meta_df['bad_var'] = meta_df.apply(lambda r: bad_flag(r), axis=1)
Now that we have our flags in place, the next step is to remove those variables from the main dataframe. To do so, we make a list of the column names where the ‘bad_var’ value is set to NOT True (!=). Next, we replace our dataframe with a new dataframe that only includes the columns of interest at this point.
keep = (meta_df['index'].loc[meta_df['bad_var'] != True]).tolist()
df = df[keep]
How’d we do? We started with 150 variables and are now down to 113, we have been able to remove 37 columns from our dataframe.
Step 2: Visualizations
The above step was quick and easy to accomplish because we relied on numeric representations to help assess the degree of variability present in each variable. For some variables, we may still be unsure on their information value.
For example, some variables have a single value that occurs in 70% of the rows. If there are only two options, this may be okay however if there are more than 2 options there may not be enough variability in the remaining values to be useful to a model. In these cases, using visual representations of the data can help us further make decisions regarding what to keep and what to get rid of.
The first step is to look at our meta-data dataframe and inspect anything out of the ordinary or suspect. For example, one of our variables has over 75% of the rows labeled as "Unknown."

Let’s look at the bar chart for this variable:

Based on the bar chart, the variable "Variable_2" has two values, "Unknown" and "Yes." In this instance, it is unclear what the difference between a value of "Unknown" and a value of "Yes" would mean (some of the "Unknowns" could be "Yes") and so we may decide to eliminate this variable.
Other observations our visual examination of the data reveal include that "Age" is highly populated by a default value of "-1" and quite a few variables have a value of "Unknown." Although a proper treatment of feature engineering is out of scope for this article, these additional observations suggest the need for additional engineering of features to yield more useful data for a model.
When performing EDA for data reduction, it may also be useful to look at a correlation matrix, as some variables can be so highly correlated as to suggest they are essentially the same variable and therefore one or more can be removed from the analysis. In other words, the variables are not adding anything unique to the model.
A quick visual inspection would be to generate the heatmap. To do so all one needs to do is follow the code here. This is what our heatmap looks like using the same code:

For example, there appear to be a chunk of variables in the Variable_30 – Variable_50 range that have very high correlations and may not be useful to keep.
In the code block below, we dive a little deeper into the correlation matrix, extracting only those variables that have correlations above .90.
high_corrs = corr[corr.iloc[:,:] > .90]
meta_corr = high_corrs.isna().sum().reset_index()
potential_drops = meta_corr.loc[(meta_corr[0] < meta_corr[0].mode()[0])]
From here it is up to the data scientist to evaluate which of these variables should be deleted and which should be kept.
Step 3: Dimension Reduction
If after performing the steps above, we still find ourselves with too many variables, one final bit of EDA that can be useful for reducing the number of variables while still maintaining their potential information value is to use exploratory factor analysis or principal components analysis. These analytics are similar to the correlation heatmaps we generated in step 2 but go a bit further by looking for groups of correlated variables, rather than simple bivariate correlations.
Factor analysis helps us to identify whether variables share enough variance, that we could conceivably do one of two things; combine them into an index of their average value (exploratory factor analysis) or focus on only the most important variable in a factor and ignore the rest (principal components analysis). You could do more with the results from factor analysis but these are the two most common things that I do when working with data in the wild.
Although these techniques are often considered to be "unsupervised" in that we allow the algorithm to identify the natural groupings of data, the reality is that data scientists must apply some meaningful framework for grouping the data before ever performing the analysis.
What do I mean by meaningful framework?
For example, let’s say a group of variables represent different aspects of health whereas others are items from a customer satisfaction survey. It is not all that meaningful to look for factors that bridge across these two separate domains of data. Thus, I often find it more useful to perform multiple iterations of cluster analysis to identify natural groupings of data that may indeed have natural groupings.
In the code below, we round out our data reduction by performing an example factor analysis with the demo data set we have been working with. The code was forked from this great tutorial on factor analysis here and applied to our data. Here is a very brief overview of the steps involved:
First, we list out the specific columns we want to include.
Second, we perform an adequacy test to ensure factor analysis is a viable option for the given data. Note we are looking for a p-value less than .05, which informs us the correlation matrix is not an identity matrix.
Third, we perform the factor analysis.
Fourth, we visualize the eigen values and look for an "elbow" in the graph to help us decide how many factors our model should include.
Next, we re-do the factor analysis and set the number of factors based on our "elbow" plot analysis.
Finally, we extract the factor loadings and view them to determine how we might want to combine our variables.
#subset our dataframe to focus only on domain relevant variables
df_fac = df.iloc[:,98:]
import pandas as pd
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
#df_fac from optum_data_segmentation.py
df_fac = df_fac.dropna()
chi_square_value,p_value=calculate_bartlett_sphericity(df_fac)
chi_square_value, p_value
# Create factor analysis object and perform factor analysis
fa = FactorAnalyzer()
fa.fit(df_fac)
# Check Eigenvalues
ev, v = fa.get_eigenvalues()
ev
# Create scree plot using matplotlib
plt.scatter(range(1,df_fac.shape[1]+1),ev)
plt.plot(range(1,df_fac.shape[1]+1),ev)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.grid()
plt.show()
# Create factor analysis object and perform factor analysis
fa = FactorAnalyzer()
fa.set_params(n_factors=2, rotation="varimax")
fa.fit(df_fac)
loads = pd.DataFrame(fa.loadings_)
loads_df = pd.concat([loads, pd.DataFrame(list(df_fac.columns))], axis=1)
loads_df.columns = ['Factor_1','Factor_2','Variable']
Here is what are "loads_df" dataframe looks like:

In the output of the factor loadings we see that most of the variables load more heavily on Factor 1, than Factor 2. That said, we see that "Variable_141" and "Variable_143" both load more heavily on Factor 2.
A next step for dimension reduction from this point would be to create indexes for each factor by averaging the variables that load on each factor together.
For example, if we wanted to create an index for factor 2 using "Variable_141" and "Variable_143" we first need to note that "Variable_143" has a negative loading. Therefore, before averaging with "Variable_141" we need to reverse code that variable.
A quick trick for reverse coding is to multiply the variable by -1 before averaging. Please note however, that we would only do this if we also intend to normalize the data before modeling (a highly recommended procedure for most Data Science models).
Here is some example code for creating the Factor 2 index:
df['Variable_143_R'] = -1*df['Variable_143']
df['Factor2'] = df[['Variable_141','Variable_143_R']].mean()
Our Clean Act Can Now Take 2
Now that we have spent the time reducing the number of dimensions in our data set, we are ready to pass our data off to more sophisticated feature engineering and eventually modeling downstream.
With a total reduction of over 50% of the available data, my palms are now much less sweaty. And so I leave you with this:
"He better go capture this moment and hope it don’t pass him"
Onward!
Like engaging to learn about data science, career growth, or poor business decisions? Join me.