The world’s leading publication for data science, AI, and ML professionals.

Dataset Biases: Institutionalized Discrimination or Adequate Transparency?

A Review of the Efforts Performed by the US Mortgage Disclosure Act

Composite image from photos by Shane and Binyamin Mellish from Pexels.
Composite image from photos by Shane and Binyamin Mellish from Pexels.

[…] Why would we need such a law? Prior to Congress’s enacting HMDA in 1975, the public raised considerable concerns about mortgages – or, more importantly, the lack thereof – in some urban, often minority, neighborhoods. Certain areas seemed to decline, in part because their residents were not able to obtain home mortgages. (ClevelandFed)

This was one of the sad realities of certain American population centers in the 1970s. Access to capital remained difficult, and social mobility for at-risk neighborhoods was quasi-non-existent. This difficulty was accentuated by some believed to be institutionalized racism in the banking system. "Congress believed that some financial institutions had contributed to the decline of some geographic areas by their failure to provide adequate home financing to qualified applicants on reasonable terms and conditions." (_Wikipedia_)

Therefore, a motion was brought forward to Congress to support transparency across all lending practices. They passed the Home Mortgage Disclosure Act of 1975 for mandatory reporting of all loan applications, and then the Community Reinvestment Act of 1977 for encouraging financial institutions to help meet the credit needs of their local communities.

There is a very clear mandate to the HMDA, as explained by Investopedia:

In general the primary purposes of the Home Mortgage Disclosure Act and Regulation C are to monitor the geographic targets of mortgage lenders, provide an identification mechanism for any predatory lending practices and to provide reporting statistics on the mortgage market to the government.

The HMDA helps to support the community investment initiatives sponsored by government programs, with HMDA contributing to the oversight of the initiatives through statistical reporting. HMDA also helps government officials to identify any predatory lending practices which may be affecting mortgage loan issuance.

HMDA submissions also provide a means for analyzing government resource allocations and ensuring that resources are appropriately allocated to fund community initiatives.

Therefore, financial institutions must report on their lending practices specifically by reporting not only all of the loans issued but also every loan application with the associated metadata of the applicant(s), such as race, gender, and neighborhood, as well as if the loan was approved or denied.

From a regulatory perspective, there was now a lens through which violations to social equality can be tracked and penalties applied. This also means that there is an intentional bias in the data – that of institutionalized racism across the banking sector, and all of its manifestations.

Exploring the Data

Let’s explore the Home Mortgage Disclosure Data Files (1981–2014) from the US Archives to get a sense of what has been reported, and why.

(If you want to explore this dataset at home, you’ll also need a few more datasets, such as the relevant census data. Luckily, the superstar team of librarians at the US Archives packaged everything up for us in a single page. Thank your local librarian.)

1981–1990

By exploring the data, we see that the 1981–1990 period was primarily focused on tracking veterans, many of them returning from Vietnam. A lot more of the questions and data pertains to VA applicants, and if they had requested financial support to apply for the loan.

The encoded format is quite straightforward, just needs a bit of mapping:

[{'NAME': 'respondentName', 'START': 0, 'STOP': 28, 'LENGTH': 28},
 {'NAME': 'respondentID', 'START': 28, 'STOP': 36, 'LENGTH': 8},
 {'NAME': 'reportMSA', 'START': 36, 'STOP': 40, 'LENGTH': 4},
 {'NAME': 'censusTract', 'START': 40, 'STOP': 46, 'LENGTH': 6},
 {'NAME': 'state', 'START': 46, 'STOP': 48, 'LENGTH': 2},
 {'NAME': 'county', 'START': 48, 'STOP': 51, 'LENGTH': 3},
 {'NAME': 'supervisoryAgencyCode', 'START': 51, 'STOP': 52, 'LENGTH': 1},
 {'NAME': 'censusValidityFlag', 'START': 52, 'STOP': 54, 'LENGTH': 2},
 {'NAME': 'VA_FHA', 'START': 54, 'STOP': 55, 'LENGTH': 1},
 {'NAME': 'vaNumLoans', 'START': 55, 'STOP': 59, 'LENGTH': 4},
 {'NAME': 'vaTotalLoans', 'START': 59, 'STOP': 68, 'LENGTH': 8},
 {'NAME': 'convLoansFlag', 'START': 68, 'STOP': 69, 'LENGTH': 1},
 {'NAME': 'convNumLoans', 'START': 69, 'STOP': 73, 'LENGTH': 4},
 {'NAME': 'convTotalLoans', 'START': 73, 'STOP': 82, 'LENGTH': 9},
 {'NAME': 'hiFlag', 'START': 82, 'STOP': 83, 'LENGTH': 1},
 {'NAME': 'hiNumLoans', 'START': 83, 'STOP': 87, 'LENGTH': 4},
 {'NAME': 'hiTotalLoans', 'START': 87, 'STOP': 96, 'LENGTH': 9},
 {'NAME': 'multiFlag', 'START': 96, 'STOP': 97, 'LENGTH': 1},
 {'NAME': 'multiNumLoans', 'START': 97, 'STOP': 101, 'LENGTH': 4},
 {'NAME': 'multiTotalLoans', 'START': 101, 'STOP': 110, 'LENGTH': 9},
 {'NAME': 'nonFlag', 'START': 110, 'STOP': 111, 'LENGTH': 1},
 {'NAME': 'nonNumLoans', 'START': 111, 'STOP': 115, 'LENGTH': 4},
 {'NAME': 'nonTotalLoans', 'START': 115, 'STOP': 124, 'LENGTH': 9},
 {'NAME': 'recordQuality', 'START': 124, 'STOP': 125, 'LENGTH': 1}]

Overall, lots of VA considerations.

1990–2014

A big change was introduced in 1989 to the tracking priorities. From the FFIEC:

"[…] In 1989, the Federal Reserve Board revised Regulation C, to incorporate amendments contained in the Financial Institutions Reform, Recovery and Enforcement Act (FIRREA). The FIRREA amendments accomplished the following: expanded the coverage of HMDA to include mortgage lenders not affiliated with depository institutions or holding companies; required reporting of data regarding the disposition of applications for mortgage and home improvement loans in addition to data regarding loan originations and purchases; and required most lenders to identify the race, sex, and income of loan applicants and borrowers."

As expected, the file mappings changes significantly, bt now follows a pattern usable until 2014:

[{'NAME': 'ASOF_DATE', 'START': 0, 'STOP': 4, 'LENGTH': 4},
 {'NAME': 'RESP_ID', 'START': 4, 'STOP': 14, 'LENGTH': 10},
 {'NAME': 'AGENCY_CODE', 'START': 14, 'STOP': 15, 'LENGTH': 1},
 {'NAME': 'LOAN_TYPE', 'START': 15, 'STOP': 16, 'LENGTH': 1},
 {'NAME': 'LOAN_PURPOSE', 'START': 16, 'STOP': 17, 'LENGTH': 1},
 {'NAME': 'OCCUPANCY', 'START': 17, 'STOP': 18, 'LENGTH': 1},
 {'NAME': 'LOAN_AMOUNT', 'START': 18, 'STOP': 23, 'LENGTH': 5},
 {'NAME': 'ACTION_TYPE', 'START': 23, 'STOP': 24, 'LENGTH': 1},
 {'NAME': 'PROPERTY_MSA', 'START': 24, 'STOP': 28, 'LENGTH': 4},
 {'NAME': 'STATE_CODE', 'START': 28, 'STOP': 30, 'LENGTH': 2},
 {'NAME': 'COUNTY_CODE', 'START': 30, 'STOP': 33, 'LENGTH': 3},
 {'NAME': 'CENSUS_TRACT_NUMBER', 'START': 33, 'STOP': 40, 'LENGTH': 7},
 {'NAME': 'RACE_APPLICANT', 'START': 40, 'STOP': 41, 'LENGTH': 1},
 {'NAME': 'RACE_COAPPLICANT', 'START': 41, 'STOP': 42, 'LENGTH': 1},
 {'NAME': 'SEX_APPLICANT', 'START': 42, 'STOP': 43, 'LENGTH': 1},
 {'NAME': 'SEX_COAPPLICANT', 'START': 43, 'STOP': 44, 'LENGTH': 1},
 {'NAME': 'APPLICANT_INCOME', 'START': 44, 'STOP': 48, 'LENGTH': 4},
 {'NAME': 'PURCHASER_TYPE', 'START': 48, 'STOP': 49, 'LENGTH': 1},
 {'NAME': 'DENIAL_REASON_1', 'START': 49, 'STOP': 50, 'LENGTH': 1},
 {'NAME': 'DENIAL_REASON_2', 'START': 50, 'STOP': 51, 'LENGTH': 1},
 {'NAME': 'DENIAL_REASON_3', 'START': 51, 'STOP': 52, 'LENGTH': 1},
 {'NAME': 'EDIT_STATUS', 'START': 52, 'STOP': 53, 'LENGTH': 1},
 {'NAME': 'SEQUENCE_NUMBER', 'START': 53, 'STOP': 60, 'LENGTH': 7}]

This means that we can now tell, for every application:

  • The race and gender of the applicant;
  • the race and gender of the co-applicant;
  • the neighborhood that they were in;
  • the purpose of the loan (house or repair);
  • the list of reasons why a loan might’ve been denied;
  • the total number of forms required to be submitted; and
  • each and every one of the dates of the application process.

Was a loan application intentionally slowplayed? Was a home repair denied that would’ve otherwise been approved? What about same-sex co-applicants? All visible.

(Note: we haven’t made our processed data available yet, as we’re still processing and investigating it. There are many reporting errors, like state acronyms instead of state codes, erroneous census tracks, and many other issues that still require a tremendous amount of preprocessing of the data to make sense of decades-long shifts in social norms. It would be unwise to come to any conclusions with it at this time. )

Over the years, there’s more and more adherence to the program across most banks. The digitization wave that spread across the US can be verified year over year. There’s also a distinct shift downwards of less and less mortgage loan applications, stemming most likely from the subprime mortgage crisis:

There's a clear step change around 2008 for the total number of mortgage loan applications.
There’s a clear step change around 2008 for the total number of mortgage loan applications.

Great so far, isn’t it? Now, here’s the double-edged sword about this data:

The same data that can be used to protect citizens against predatory bank practices can be used as a discriminatory weapon against them if data scientists do not understand the data they are using.

How so? This dataset now has within it every bias, preferential treatment, and erroneously or maliciously declined loan application. Every possible discriminatory event is permanently recorded. There is no sanitizing possible of this data.

The HMDA dataset is biased – absolutely.

It’s up to the data scientists to understand what are the right questions to ask, and more importantly, what not to ask.

Should this data even exist?

By any measure, this data set can be considered invasive and discriminatory. Canada, usually considered a (slightly) more tolerant country, has a different approach to tracking race and racism. As explained in The Conversation:

Canada’s anti-racism strategy, which draws on decades’ worth of research, states that race is a social construct. There is no basis for classifying people according to race, but racial bias and discrimination have very real effects. The question is: How do we get relevant data from the census and other surveys on the impact of systemic racism?

Statistics Canada tries to gather this information without directly asking about race. Race-based data is needed, says Jean-Pierre Corbeil, a diversity specialist at Statistics Canada. But he wonders whether that actually requires referring to race on the census.

Historically, the government has been reluctant to ask directly about race, which has led to a lack of disaggregated data. After the Second World War, the census used indirect methods of estimating the non-white, non-Indigenous population through racial proxies like language or ethnocultural origin.

(Note: one of the major complaints against Canadian multiculturalism as a pillar of civil Society is that it allows for people to classify others by ethnicity or nationality, under the veil of a self-assigned permission structure, but that’s a topic for later.)

So, this means some countries refuse to track the race and gender of bank loan applicants, as measuring racism fundamentally amplifies its very divisive nature of bucketing people into categories. And so, looking back at the HMDA data, the question that can be asked is, "should this data even exist?"

I believe the answer is a strong yes. The Act has yielded tremendous protection to the public, especially with a major investigation back in 2005. From the Buffalo News:

New York Attorney General Eliot Spitzer has fired his latest salvo, launching a preliminary probe into mortgage lending practices at eight major banks in the state, including HSBC Bank USA. […]

Investigators are trying to determine how the banks price their loans, and if fees and interest rates are being applied fairly, or whether there’s racial discrimination. […]

Citigroup already agreed in 2002 to pay $215 million to settle allegations by the Federal Trade Commission that Associates First Capital Corp. – which Citi bought in 2000 – had engaged in predatory lending. Associates’ rival Household International – acquired by HSBC Holdings Plc in 2003 – paid $484 million in 2002 to settle similar charges by all 50 states in the largest consumer settlement ever.

However, although this data has yielded success, it’s up to the data scientists to understand what are the right questions to ask, and more importantly what not to ask. For instance, trying to build a mortgage loan approval pipeline from this data is a horrendous and terrible idea – you now have a racist, homophobic bot. However, if you wish to investigate the very nature of institutionalized discrimination by investigating outliers, then you have a tremendous weapon at your disposal.

How you wield that weapon is up to you.


Additional Reading


Happy Consulting!

-Matt.

If you have additional questions about this article or our AI consulting framework, feel free to reach out by LinkedIn or by email.

Other articles you may enjoy


Related Articles