The world’s leading publication for data science, AI, and ML professionals.

Insurance ‘Redlining’ Investigation

Analyzing Minority Access to Private Insurance in 1970s Chicago

Photo by Erol Ahmed on Unsplash
Photo by Erol Ahmed on Unsplash

Abstract

This project seeks to investigate claims of ‘redlining’ in Chicago‘s private insurance market through exploratory data analysis. The term ‘redlining’ was originally coined in the late 1960s to refer to lending practices wherein banks would refuse to grant credit to specific neighborhoods or regions within their service. Over time, the term adopted a broader definition and now applies to discriminatory denial of service on the basis of race, gender, religion, nationality, and the like. All analysis was done with R in RStudio.


About The Data

The dataset used for this analysis contains observations of 47 different Chicago ZIP codes (c. 1970). Before delving into the information contained within each column, it’s worth further explaining the proxy that will be used to determine what will be referred to as "access" throughout the rest of this analysis – new FAIR policies and renewals. FAIR (Fair Access to Insurance Requirements) policies, in short, are government funded insurance policies meant to help those who may otherwise struggle to get private coverage for a variety of reasons (neighborhood, local weather, fire damage, age, etc.). The logic behind including this as the response variable of interest is rather straightforward – the more active FAIR policies in an area, the less citizens are able to access private insurance coverage. This isn’t a perfect measure of "access" as it is conventionally defined, but more on that later.

For now, the columns are as follows:

policies: New FAIR policy plans and renewals per 100 housing units in a given ZIP code. This will be our main variable of interest, as higher per-capita rates of FAIR policy enrollment implies less access to private insurance.

minority: The percentage of residents in that ZIP code that self identified as member of a racial/ethnic minority in the last U.S. Census.

fires: The number of confirmed fires per 100 housing units in a given ZIP code

age: Percent of housing units in a given ZIP code that were built before World War 2

income: median family income (in thousands of USD)


Approach

To approach this analysis, I began by constructing a simple visualization to explore the relationship between new FAIR policies or renewals and the minority percentage of any given ZIP code:

ggplot(data=redline, mapping=aes(x=minority, y=policies)) +
  geom_point() +
  geom_smooth(se=F, method=lm) +
  labs(
    title = "FAIR Policies and Renewals as a Function of Neighborhood Minority Percentage",
    caption = "*Policies per 100 Housing Units",
    x = "Minority Percentage",
    y = "New FAIR Policies or Renewals"
  )

Running the above code block yields the following scatter plot:

Image by Author Using RStudio
Image by Author Using RStudio

Interestingly enough, there does seem to be a general positive trend between our two main variables of interest. However, it is naive to use a simple correlation such as this to confirm any causal assertion. It’s quite reasonable for insurance companies to have sensical, non-discriminatory, reasons for denial of coverage. Some neighborhoods may be more susceptible to fire or structural problems due to age – factors that commonly play a role in actuarial calculations. Additionally, as mentioned earlier, new FAIR policies alone cannot be used as a proxy for per-capita access to private insurance. It’s possible that some households may decide against private insurance due to an inability to afford it, rather than because they were denied coverage outright. Therefore, it’s necessary to adjust further estimations for differing income levels as well.

After taking these caveats into account, I fit a multiple regression model with adjusting variables to quantify the observed trend:

lm = lm(policies ~ minority + fire + age + income, data=redline)

As you can see, this regression models policy filings as a function of four adjusting variables (minority, fire, age, and income). Upon executing the above code block, the following parameter coefficients are output to the console:

(Intercept)    minority        fire         age      income 
     -0.170       0.008       0.023       0.006      -0.012

In mathematical terms, these estimates simply represent the following equation:

Policies = -0.170 + _minority_0.008 + _fire_0.023 + age0.006 + _income_(-0.012)

While these estimates are great for simple linear approximation, it’s still necessary to estimate confidence intervals as a measure of uncertainty. This can be done rather effectively through bootstrapping:

lm_boot = do(10000)*lm(policies ~ minority + fire + age + income, data=resample(redline))
confint(lm_boot) %>%
  mutate_if(is.numeric, round, 3)

This will output the following 95% confidence interval:

  name      lower  upper  level  method     estimate
1 Intercept -1.468  0.607  0.95 percentile   -0.170
2  minority  0.002  0.015  0.95 percentile    0.008
3      fire  0.002  0.058  0.95 percentile    0.023
4       age  0.000  0.012  0.95 percentile    0.006
5    income -0.063  0.075  0.95 percentile   -0.012
6     sigma  0.246  0.435  0.95 percentile    0.380
7 r.squared  0.557  0.854  0.95 percentile    0.672
8         F 13.194 61.174  0.95 percentile   21.477

An analysis of variance (ANOVA) can also be conducted to attribute R2 improvement to specific variables, but due to the increased subjectivity that comes with order selection, I opted against including one.


Conclusion

After analyzing the results of this quick regression analysis, there seems to be at least some evidence to suggest that the number of new or renewed FAIR policies accounted for in a given ZIP code are at least in part associated with its ethnic makeup. To put this effect more precisely, we should expect an increase of ~0.008 additional policies per 100 housing units for every percentage increase in minority population. While this estimate does seem to suggest that Chicago’s minority population (at least those living in the 47 ZIP codes analyzed) have less access to private insurance, it is still unclear whether this is a direct result of discrimination.

To come to a clearer conclusion on the matter and bolster the strength of this analysis, I would recommend the following:

  1. Consult with local experts in racial history and municipal + state + federal FAIR lending procedures, as well as private insurance providers that commonly serve the 47 ZIP codes in question. The observed offset of minority classification looks small in absolute terms, but as a non-expert myself, it is possible that I lack the context to make a just assertion about the "size" or "significance" of this effect.
  2. Increase sample size. Bootstrapping is a great tool for exploratory analysis, but it will never replace observed data. This analysis relies on one year’s worth of observations on 47 Chicago ZIP codes to make an assertion about discriminatory business practices. To make a more serious assertion on the matter, I’d love to see data dating back to at least 1960. This would encompass a large portion of the Civil Rights movement, and provide the greatest chance of observing more changes in insurance coverage resulting from customer ethnicity.
  3. Include more applicable parameters and interaction terms in a new model. Similar to my second point, the scope of this analysis relies on the variables we have on hand in this dataset (minority, fire, age, income). Insurance companies will likely take other factors into account when evaluating risk (roof quality, inclusion of wood-burning stoves, likelihood of a break-in, etc.) that are not included in this analysis. It is also possible that insurance companies could refuse coverage due to an individual’s credit score prior to pursuing coverage. In this case, the distinction between redlining on behalf of the insurance provider or another financial institution can be blurred.

That being said, this was still a very enlightening historical investigation. Much can always be gleaned from interesting datasets, no matter how quick the analysis.


Related Articles