Exploring the census income dataset using bubble plot

Published in

Towards Data Science

6 min readJul 28, 2017

One of the nicest things about data science is data exploration and visualization. That’s when the data tells us a story, and we just have to take some time, explore and listen carefully.

When exploring a data set, we look at the connection between different features in the data and between the features and the target. This can give us a lot of insights about how we should formulate the problem, the required preprocessing (missing values, normalization), which algorithm should we use to build our model, should we segment our data and build different models for different subsets of our dataset, etc.

Census Income dataset

To demonstrate this, I’ve chosen the Census Income dataset which has 14 attributes and 48,842 instances. This data set goal is to predict whether income exceeds $50K/yr based on census data.

The features of the data are: ‘age’, ‘workclass’, ‘fnlwgt’, ‘education’, ‘education-num’, ‘marital-status’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’,
‘capital-gain’, ‘capital-loss’, ‘hours-per-week’, ‘native-country’.

The target is the income level: >50K or ≤50K.

Here is a sample of the Census Income dataset:

The connection between the age (numerical) and the income (categorical)

Let’s look at the scatter plot of the age vs. the income.

y_axis: >50K → 1 , ≤50K. →0 , x_axis: age

from matplotlib import pyplot as plt
plt.scatter(data.age, (data.target==’<=50K’))

**Scatter plot — Income vs. age - x_axis**: **age**, **y_axis: income** >50K → 1 , ≤50K. →0

This doesn’t tell us much right? All the points are one after the other. We can add some random noise to each level to achieve more scattered points.

plt.scatter(data.age, (data.target==’<=50K’)+0.5*np.random.rand(len(data)))

**Scatter plot — Income vs. age —x_axis**: age, **y_axis:** >50K → values above 1 , ≤50K. → values below 0.5

This is better, but still, it is hard to understand the patterns when there are many points.

Now let’s try the bubble plot (full documentation and code for the bubble plot is available at https://github.com/shirmeir/bubble_plot).

Install package by:

pip install bubble_plot

We need to supply the dataframe, the x-axis and the y-axis.

from bubble_plot.bubble_plot import bubble_plotbubble_plot(data,’age’,’target’, normalization_by_all=False)

**bubble plot** — target vs. age — P(y/x)

Now we can see an actual pattern!

For numerical features, such as age, the bubble plot creates buckets. The size of each bubble is proportional to the number of points in each bucket (and in this case — also the color).

We can see that people with the highest income are mostly around the age of 39–45 (the middle of this bucket is 42.5).

Setting the parameter normalization_by_all to False defines that we would like to plot P(y/x), meaning, plot the distribution of the income (y) given the age (x). Each column in this plot is an independent (1D) histogram of the values of the income given the age.

Setting the parameter normalization_by_all to True would plot the joint distribution of the age (x) and the income (y), P(x,y), which is in fact a 2D histogram with bubbles.

bubble_plot(data,’age’,’target’, normalization_by_all=True)

**bubble plot** - target vs. age — P(x,y)

Now we get the joint distribution of the of the income (y) and the age (x), P(x,y). From here we can see that most of the people in our data is within the younger ages (around 20–30), and that a small fraction of the young people (around age 20) group has high income because their bubble is very small. Within the high income people, the largest age group is around the age of 42.

That was a plot of categorical feature vs. numerical feature. But what if we want to visualize the connection between two numerical features?

Plotting the connection between two numerical features

Let’s review the working hours per week vs. the age.

plt.scatter(data.age, data[‘hours-per-week’])

**Scatter plot — working hours per week vs. age — x_axis**: age, **y_axis:** working hours

Again, since this data set has many points, from the scatter plot you can’t understand much about the connection between these two variables.

Using the bubble plot we can get something much clearer.

bubble_plot(data,’age’,’hours-per-week’, normalization_by_all=False)

**Bubble plot — working hours per week vs. age — P(y|x)**: **x_axis**: age, **y_axis:** working hours, distribution of the working hours given the age group. The bubble are normalized by the x-axis (age) so in each column the bubble size sums up to 100%

We can see that the given one’s age is around 20, one is most probable to work around 15–20 hours a week, while if one’s within the age of 35–45 one is more likely to work 45–90 (!) hours a week.

The bubble plot create buckets for both of the numerical features (age and working hours per week), and the bubble size is proportional to the frequency of the counts of the working hours per week, given the age.

Now let’s look at the joint distribution.

bubble_plot(data,’age’,’hours-per-week’, normalization_by_all=True)

**Bubble plot — working hours per week vs. age, P(x,y)**: **x_axis**: age, **y_axis:** working hours

Now the bubble size is proportional to the frequency of the values of the hours per week and the age together. We can see that most of the people in this data set are around the ages of 20–30 and work about 30–45 hours a week.

Visualize three dimensions with bubble plot — age, working hours and income

Now let’s look at the age, hours per week, combined with the income level. How is that even possible? Can we visualize three dimensions of information in a two dimensional plot?

bubble_plot(df, x=’age’, y=’hours-per-week’, z_boolean=’target’)

**Bubble plot — working hours per week vs. age —P(x,y): x_axis**: age, **y_axis:** working hours, **color** — proportional to the rate of high income people within each bucket

In this bubble plot, we see again the joint distribution of the hours-per-week vs. the age (p(x,y)), but here the color is proportional to the rate of high income people — (#>50K/((#>50K)+(#≤50K)) - within all the people in this bucket. By supplying the z_boolean variable, we added additional dimension to the plot using the color of the bubble.

The pinker the color, the higher the ratio for the given boolean feature/target Z. See colormap in the image.

Cool colormap — Pink would stand for the higher ratios in our case, cyan would stand for the lower ratios

This plot shows us clearly that the higher income is much more common within people of age higher than 30 which work more than 40 hours a week.

Workclass vs. Age vs. and income

**p(x,y): y-axis - workclass, x-axis - age**. **color** is proportional to the **rate of high incomes** within each bucket.

Here the y-axis are the workclass, the x-axis are the age and the color is proportional to the rate of high incomes within each bucket.

You can see that you get highest probability to have an income >50K if you are in the self-emp-inc workclass within the ages of 30–60. Federal-gov also has higher rates to get income >50K but within the age of 40–60. State gov has higher rate high incomes around the ages of 48–52.

Occupation vs relationship and income

**p(x,y): y-axis — occupation, x-axis — relationship**. **color** is proportional to the **rate of high incomes** within each bucket.

Prof-specialty, Exec-managerial and tech-support has the highest rate of high income people. Married people seem to have a much high probability to have high incomes. Notice that among the Exec managerial, a Husband has a slightly higher rate (more pink) of high income rate than Wife.

Summary

While scatter plots and boxplots can give us a high level look at the data distribution and summary statistics, this is sometimes not enough, especially in the cases where the data has many points and the connection between variables is not a trivial function.

Using bubble plot to visualize our data can help us see clearly the relations between features in our dataset, even with a large dataset, for categorical as well for numerical features, which can help us model the data in a more suitable way and find the right function for the connection and the features.

Install bubble_plot:

pip install bubble_plot

Run bubble_plot on python:

import pandas as pd  
from bubble_plot.bubble_plot import bubble_plot
from sklearn.datasets import load_boston                            
data = load_boston()                            
df = pd.DataFrame(columns=data['feature_names'], data=data['data'])                            
df['target'] = data['target']                            
bubble_plot(df, x='RM', y='target')