Data Jam Session: How parent income determines a child’s probability of going to college

Inspired by a New York Times data visualization I dug a little deeper into the relationship between college attendance rates and family incomes

Vanessa Boehm
Towards Data Science

--

Photo by Keira Burton from Pexels

On May 14, the New York Times newsletter featured a link to to a data viz about how likely it is for children of ‘poor’ and ‘rich’ families to go to college. Readers were asked to make a guess by drawing a line into a plot with the share of children that go to college on the y-axis (a number between 0 and 1) and the percentile into which the parent’s income falls on the x-axis.
You can try it out here to see how good your best guess is.

The correct curve looks like this:

Figure 1: The relation between the share of children that go to college and the percentile into which their parent’s income falls is roughly linear.

It is more or less a straight line. This means that there is a linear relationship between the parent’s income percentile and the probability of their children to go to college. The NYT article claims that this relationship surprises many readers, because many of them get it wrong.

I would argue that this surprise might not be owed to the data, but to metric that was chosen: income percentile. What I and probably most people have in mind when they read the words ‘rich’ and ‘poor’ is the actual income. Depending on the income distribution, that is how many people make how much money, neighboring income percentiles can correspond to very different incomes. Do you know how much an American in the 90th income percentile is making on average? And how this compares to the income in the 95th percentile? Probably not.

In this little data jam session, we will use the data that underlies the plot above [1], add income distribution information to better interpret it and construct a simple model for college attendance rates.

We will

  1. argue why we should look at college rate as a function of both metrics, parents’ income and parent’s income percentile,
  2. explore how the above plot translates to college rate versus income for different income distributions,
  3. look at and understand the true income distribution for this data sample,
  4. use the true college rate versus income relationship to predict how many more (or less) children could attend college if income was distributed differently. The result might actually surprise you!

Income percentile versus income

So which metric is more illuminating one: the actual income or the income percentile? To start, let’s remind us of the definition of percentile:

A family in the 10th percentile has a lower income than 90% of families in that sample and a higher income than 9% of families in that sample.

Using the percentile on the x-axis makes sense because it allows you to make quantitative statements about the population. We can see with a glimpse, for example, that children from 50% of all families have a less than 55% chance to attend college. This metric also has the nice property that the area under the curve is an estimate of the total share of children that attend college. For this data set we get 58.3%. (This assumes that the number of children is not too dependent on family income.)

However, one can argue that the relevant number that determines whether a young person can go to college is not how their parents’ income compares to other parents’ incomes, but how their income compares to the costs of higher education.

To understand why the above curve is more or less a straight line, including at the higher income percentiles (where one would expect it to flatten out, because everyone should be able to afford college, right?), the crucial bit of information that we are missing is the relationship between income percentile and actual income.

A look at different income distribution scenarios

To illustrate let’s make a few guesses for possible income distributions, that is the parent’s household income in each percentile:

Figure 2: Different income distribution scenarios; one is the true income distribution in our data set. Can you guess which one it is?

Let’s take a minute to understand them.

  • Scenario 1 has very few people with low incomes, and many people with higher incomes. There is no big difference in income between the 50th percentile and the 90th percentile, meaning that the top 50% of earners make very similar amounts of money. You could say that this is the income distribution of a more equitable society.
  • Scenario 2 is a linear relationship. The income difference between the 20th and 30th percentile is the same as between the 80th and the 90th percentile. All income bins are equally spaced.
  • In Scenario 3 the difference in incomes between neighboring percentiles increases with the percentile. This means that the jump in income from the 20th to the 21st percentile is much smaller than the jump from the 90th to the 91st. In this scenario, the rich are making considerably more money than the poor.
  • Scenario 4 is a mix between a linear relationship at percentiles below 75 and an exponential relationship at percentiles above 75. In this scenario the upper few percent of earners make much more money than all other earners including those well in in the 90 percentiles, while lower income earners have more similar incomes.

One of these scenarios is the true distribution (for this data set) and you can probably guess which one it is?

Yes, it is scenario 4 (the red curve)!

You might have noticed that there is something a little odd about these scenarios: They have different total amounts of money floating around (area under the curve). To fix this, we normalize them to a common area under the curve:

Figure 3: Making sure all of our income scenarios have the same total amount of money earned when summed over all income groups.

Notice that because the total amount of money is now conserved, the income of the highest earners is quite different between different scenarios.

College attendance rate versus actual income

Figure 6: In scenario 4, which is based on actual data, the college attendance rate starts to saturate at an income of about 200k.

If we plot the share of children attending college against the actual family income in US Dollars, we see that the 90% mark is reached at a household income of almost $200 000 — much more than the average American makes.

Plotting all of our scenarios against actual income, we get:

Figure 7: Plotting share of children that attend college versus family income for all income distribution scenarios.

Showing that in the other scenarios we achieve a 90% college attendance rate already at much lower incomes.

This illustrates the importance of knowing the actual income distribution (both percentiles and income value) in order to fully grasp the information in the plot that was featured in the NYT newsletter (Figure 1).

Having this insight established, let’s move on and see if we can do some further analysis.

Predicting college rates for different income distributions

So far we have kept the relationship between income percentile and college attendance fixed. This is, as I argued in the beginning, unrealistic. Rather one would expect that the probability of going to college depends on the actual income in US Dollars and how it compares to the costs of a college education.

If we fix the relationship between actual income in US Dollar and college rate instead (we can do this if we know the true income distribution), we can derive how many more children would be able to attend college in our different income distribution scenarios. (Note that this model makes the implicit assumption that the costs of higher education are independent from the income distribution).

Figure 8: Share of children attending college in different income distribution scenarios. Assuming that the probability of going to college is determined by the parent’s income not by how it compares to other incomes.

In the red line we recognize the real data we started with in our very first plot . Scenario 4 is the truth: The income to percentile relationship in this data sample is linear except in the highest income bins, where it becomes exponential. A family in the highest income percentile makes on average 1.4 Million US Dollars, while a family in the second to last income bin makes less than a third of this amount. A family in the 90th percentile makes on average $ 150 000 Dollar. This explains why the college attendance rate still depends linearly on the income percentile even at high income percentiles and why many people get Figure 1 wrong, when asked to draw it.

We already calculated in the beginning that the total percentage of children attending college in the real data sample is 58.3%. We can now compute the percentage for our other income distribution scenarios:

  • Scenario 1: 63.5%
  • Scenario 2: 62.5%
  • Scenario 3: 50.0%

Scenario 3, which has more families with higher incomes than our baseline scenario (the truth), has a college rate of only 50%. This is because the lower middle class ends up taking home less money in this scenario, which excludes many of their children from college attendance. Even in our most equitable income distribution scenario, we only improve college attendance by 4–5%.
Why is that?
If you go back to Figure 6, you will see that the probability of going to college at low to medium incomes increases steeply with income.

In order to push college attendance above 60% one would need to to bring a majority of families over an income level of $60 000,

or

lower the costs of higher education.

Summary

We have seen that both parents’ income percentiles and income are important metrics that need to be considered when analyzing college attendance rates.

We have learned that the income of parents’ in the US rises linearly with percentile among the lower 75% and exponentially in the upper 25%.

Building on the assumption that college rate depends on actual income and how it compares to college costs, we have found that even radically different income distributions would change the college rate by less than 5%!

For me this was the real surprise about this data set. It could suggest that the most efficient way to increase college attendance especially for children in low income families would be to lower the costs of higher education. This idea, however, needs to be explored in the future in a another data jam session!

For now, I hope you enjoyed this one. You can find the accompanying jupyter notebook for this analysis here.

References:

The data I used is public and was downloaded from https://opportunityinsights.org/.

Link to data: https://opportunityinsights.org/wp-content/uploads/2018/04/Statistics_By_Parent_or_child_Income_Percentile.xlsx

It was initially compiled and used for the paper:

[1] R. Chetty, N. Hendren, P. Kline and E. Saez, “Where is the Land of Opportunity? The Geography of Intergenerational Mobility in the United States” (2014), The Quarterly Journal of Economics

Link to paper: https://www.nber.org/papers/w19843

The New York Times data visualization that sparked this analysis:

https://www.nytimes.com/interactive/2015/05/28/upshot/you-draw-it-how-family-income-affects-childrens-college-chances.html?campaign_id=9&emc=edit_nn_20210514&instance_id=30809&nl=the-morning&regi_id=119956320&segment_id=58039&te=1&user_id=7bfb48191e644dd8f497d044991861fb

Any ideas and figures (if not otherwise stated) are my own.

--

--

Astrophysicist with a fable for earthbound problems. Curiosity driven. Expert in statistics and machine learning. Avid cyclist.