Why Domain Expertise is Overrated — Part I

Sammy Lee
Towards Data Science
5 min readNov 28, 2018

--

One of the most fundamental misconceptions that Data Science is helping to dismantle is the idea of an all-powerful, all-knowing expert who likes to tell regular people what to do, how to eat, what to buy, how to raise our kids, and so on.

But there are countless instances where non-experts usurped the old-world-order to prove the experts wrong.

Historical examples include the Wright Brothers who risked their lives by tinkering, instead of deriving theories of physics. Or the countless medical advancements that occurred through pure serendipity instead of top-down directed research.

Back in the early 2000s, the prevailing wisdom for the longest time was that the number of wins in baseball was highly correlated to a team’s level of salary.

But Billy Beane, general manager of the lowest salaried team in all of baseball, took the Oakland A’s to the playoffs four consecutive times — an astoundingly low probability event.

How did he do it? He did it by focusing on a statistic that none of the experts in baseball thought very much about — On base percentage (OBP):

“An Island of Misfit Toys”

A Non-Trivial Example

Real estate cannot be lost or stolen, nor can it be carried away. Purchased with common sense, paid for in full, and managed with reasonable care, it is about the safest investment in the world.— Franklin D. Roosevelt, U.S. president

A myth that still perpetuates in American society is the idea that single family homes are good and safe investments that have tended to go up in price over the long-run.

As we’ve seen from the last recession (circa 2009), this isn’t a trivial matter. I remember the anxiety in my coworker’s voices as they talked about buying $500,000 houses in Las Vegas, or owning three homes at once (as a 24 year-old bartender).

We all know how The Great Recession ended, but what always looms in my mind is where were the experts who were supposed to warn us?

We’ve never had a decline in house prices on a nationwide basis — Ben Bernanke (former Fed Chairman on CNBC in 2005)

Arguably one of the most powerful persons in the world, Ben Bernanke, was telling Americans to keep calm and carry on.

Was Ben Bernanke right?

Let’s fire up a Jupyter Notebook.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
xlsx = pd.ExcelFile('HomePriceIndex.xls')
df = pd.read_excel(xlsx, skiprows=6)
df.rename({'Date': 'Year',
'Index': 'RealHomePriceIndex',
'Index.1': 'RealBuildingCost',
'Millions': 'Population',
'Long Rate': 'LongRate'}, inplace=True, axis=1)
df = df[['Year', 'RealHomePriceIndex', 'RealBuildingCost', 'Population', 'LongRate']]
df.head()

The dataset comes courtesy of Yale economist Robert Shiller who was one of the few economists to predict the home price crash in his book Irrational Exuberance. For those who want to skip the preprocessing, you can get the cleaned dataset here.

As an aside, this dataset is the same one that’s now used as the standard for U.S. national home prices by Standard & Poor’s. Prior to Shiller, there were no reliable home price indices that stretched back as far as late 19th century. We stand on his shoulders.

Let’s continue by checking dtypes and turning them to numeric values.

df['LongRate'] = pd.to_numeric(df['LongRate'], errors='coerce').dropna()

If you examine the entire dataset, the major problem is that it’s a time series where the frequency changes in the middle of the set. Starting at 1953 the frequency switches from annual to monthly all the way to 2018.

Because this problem concerns just the ‘Year’ and ‘RealHomePriceIndex’ variables we will separate them out, resample and aggregate using the mean, and put the two DataFrames together into one that we can work on.

df_2 = df[['Year', 'RealHomePriceIndex']]
df_2.drop([851,852,853], inplace=True)
# Resample 'Year' and make it into an indexdf_2['Year'].loc[0:62] = pd.date_range('1890', periods=63, freq='A')
df_2['Year'].loc[63:] = pd.date_range('1953', periods=788, freq='M')
df_2.index = df_2.Year
df_2.drop('Year', inplace=True, axis=1)
df_2_bottom = df_2['1953-01-31':].resample('A').mean()
df_2_bottom = df_2_bottom.astype('int')
df_2 = df_2.loc['1890-12-31': '1952-12-31']
df_2 = df_2.append(df_2_bottom)
df_2.index

Now we clean up the old df DataFrame that we separated ‘RealHomePriceIndex’ from.

# Drop these because we already have df_2
df.drop(['Year', 'RealHomePriceIndex'], inplace=True, axis=1)
df.dropna(inplace=True) # Drop NaN values# Truncate by 3 rows to make it even with the rest of the DataFrame
df_2 = df_2.loc['1890-12-31':'2015-12-31']
df.index = df_2.index# Finally add df_2 to df
df['RealHomePriceIndex'] = df_2['RealHomePriceIndex']
df.head()

Now let’s plot the data to see if Bernanke was right.

plt.style.use('fivethirtyeight')ax = df.RealHomePriceIndex.plot.line(x=df.index, y=df.RealHomePriceIndex, legend=False, figsize=(14,8));ax.axvline(x='2006', color='red', linewidth=2, alpha=0.8)
ax.axhline(y=0, color='black', linewidth=1.3, alpha=0.8)
ax.text(x='1880', y=185, s="U.S. Home Prices Stayed Roughly Flat for 100 Years",
fontsize=30, fontweight='bold');
ax.text(x='1890', y=170, s="Home Prices indexed to inflation, 1890=100",
fontsize=20);
ax.text(x='1992', y=25, s="2006 Peak of Bubble",
fontweight='bold', color='#CB4335');

The most important feature of the plot is that it’s indexed to inflation so it’s an apples to apples comparison of home prices in the U.S. that spans for over 100 years.

If you take your index finger and trace the plot from 1890 when the index starts at 100 to 1995 there’s barely any increase in price levels, meaning homes for most people weren’t that great of an investment compared to the 9% equity premium benchmark that stocks returned throughout the 20th century.

And house prices did drop for about 20 years at the very beginning of the 20th century.

This type of analysis did not involve calculus or linear algebra or whatever quantitative methods economists like to use for their models.

Had Bernanke and others simply paid attention to the fact that house prices in general were stagnant for 100 years, the bubbly rise starting from 1995 might’ve prompted them to take action sooner rather than later.

--

--