Data Analysis of New York City’s Condo Market using Python

Learn about the trends of New York City’s property market

Published in

Towards Data Science

8 min readNov 18, 2020

New York City is one of the most densely populated cities on the planet, with nearly 8.4 million people living within 302 square miles.

Alongside accommodating an incredibly large and diversified population NYC possesses a property market of varied architectural styles. The market itself is regarded as one of the most expensive and competitive in the world. What is interesting when examining the American housing market as a whole, is the enormous price disparity between NYC and wider America. The growing gap in property prices is incredibly concerning. A modernized 350 square foot apartment in Manhattan’s SoHo neighbourhood has been listed for $645,000 (roughly costing $1,843 per square foot). By comparison the “the typical American home is 2,687 square feet and costs $244,054” (Warren, 2020). Meaning there is a price inequality of $1752 per square foot across properties.

This article will look at:

The change in the number of condominiums being built
The changes in Market Valuation of condominiums across NYC

The dataset being used:

The data obtained within this investigation is freely available on Kaggle.com, and has been collated from the city of New York’s Department of Finance.

(https://www.kaggle.com/jinbonnie/condominium-comparable-rental-income-in-nyc)

Challenges:

Initial challenges identified in the preliminary analysis of the dataset:

Missing data — using the missingo library we can identify missing values in each column of the dataset. The ‘Year Built’ column is missing 73 values, this may in turn cause difficulties when trying to categorize Condominiums by age. Additionally, ‘Full Market Value’ is missing 2 values and ‘Estimated Expense’ is missing 1 value.

import missingno as msno
msno.bar(df1, color="dodgerblue", sort="ascending", figsize=(10,5), fontsize=12)

Solution: remove the rows containing null values, the new data frame had 21997 rows and 12 columns. However, when attempting to visualize the Condominiums by the date they were built there were still problems occurring. Upon further investigation, it was discovered that 37 rows had the Date of Build listed as 0. After deleting these rows, the new data frame had 21962 rows and 12 columns.

Using the Neighborhoods as unique identifiers — There are hundreds of neighborhoods within NYC, which are not officially designated. The 180 unique values within the ‘Neighborhood’ column may not be universally recognized and consequently, the analysis performed later may not be scalable to properties outside of the DOF data. A further issue with regard to location is the Street Address of the condominium is listed there is no ZIP code value. This means that the condominiums cannot be precisely visualized using a mapping package (for instance folium.map). However, the DOF Data Dictionary has indicated that the first number of the ‘Boro-Block-Lot’ column corresponds to a borough: 1=Manhattan, 2=The Bronx, 3=Brooklyn, 4=Queens, 5=Staten Island. Therefore, as part of the preliminary analysis, the Condominiums can be visualized from a general borough point of view.

Solution: create a new column taking the first number of the string of characters within the Boro-Block-Lot column in order to characterise the condominiums by location.

The Analysis

When were the condominiums built?

NYC has an incredibly historic property market, with the oldest structure in the city being built-in 1653. Within this dataset the earliest build date is 1825, correlating to eight buildings in the Cobble-Hill Neighborhood.

# histogram for condos by year built
num_bins = 200
data_value=df1["Year Built"]
plt.hist(data_value, num_bins, facecolor='navy', alpha=0.5)
plt.xlabel("Year Built")
plt.ylabel("Count")
plt.title("Condominiums by Year Built")
plt.show()

The histogram shows the distribution of build dates and it is clearly a huge number of buildings were built after 2000. Therefore, there will be a further deep-dive investigation into the 2000 to 2019 period.

#violin plot of ages of buildings
plt.figure(figsize=(10,6))
plt.title("Ages Frequency")
sns.axes_style("dark")
sns.violinplot(y=df1["age"])
plt.show()This report has investigated the clustering of Condominium buildings with regard to the age of the building and the Total Number of Units in the Building.

The deep-dive…

The highest count of new buildings occurred in 2006, where 1433 were built. The sharp drop starting in 2007 is a clear reflection of historic events, namely the 2007 Great Recession. While the recession officially ended in 2009 the dampening economic effects were clearly visible within the US for a much longer period. Therefore, the continued slump of new builds reflects the recovering economy of NYC and wider America.

Investigating by borough

As stated earlier the DOF categorized the Condominiums by borough, 1=Manhattan, 2=The Bronx, 3=Brooklyn, 4=Queens, 5=Staten Island, within the Boro-Block-Lot column. Therefore, a new column was created separating the first identifying character from the string (from 1–00016–7521, “1” was taken). This is a more appropriate method of analyzing condominium location than using the undefined ‘Neighborhood’ column, that contained 180 unique values.

The initial slow growth, of all boroughs excluding Brooklyn, from 2000 to 2002 is likely due to the effects of the Dot Com Bubble and the resulting crash of the Nasdaq index, a 76.81% fall (Investopedia, 2020). What is interesting is that the number of condominiums built within Brooklyn was significantly larger than any other borough, 746 at its peak in 2006. This reflects the gentrification of Brooklyn, which occurred at the start of the century. Brooklyn is no longer the manufacturing hub it once was, rather it is known for its chic hipster culture, thriving restaurant scene and soaring property prices (Hooton, 2020). The graph below indicates that there was a clear drop in the number of new buildings from 2007–2009 across all Boroughs, and indeed a continued drop in construction which reflects the continued depressive ripples to the American economy as a result of the global financial crisis. Staten Island is a clear outlier and this is because there were only 22 condominiums recorded being built across 2005, 2010 and 2014 within the dataset.

#combining all the plots for new builds!
# visualisation
fig = go.Figure()
fig.add_trace(go.Scatter(x=manhattan_builds.index, y=manhattan_builds.values,
                        mode='lines',
                         name='Manhattan',
                        ))
fig.add_trace(go.Scatter(x=bronx_builds.index, y= bronx_builds.values,
                        mode='lines',
                         name='Bronx',))
fig.add_trace(go.Scatter(x=brooklyn_builds.index, y=brooklyn_builds.values,
                        mode='lines',
                        name='Brooklyn',))
fig.add_trace(go.Scatter(x=queens_builds.index, y= queens_builds.values,
                        mode='lines',
                        name='Queens',))
fig.add_trace(go.Scatter(x=staten_builds.index, y= staten_builds.values,
                        mode='lines',
                        name='Staten Island'))
fig.update_layout(
    template='gridon',
    title='Condominiums by Year Built by Borough',
    xaxis_title='Year',
    yaxis_title='New Builds',
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    legend=dict(y=-.2, orientation='h'),
    shapes=[
        dict(
            type="rect",
            x0="2007",
            y0=0,
            x1="2009",
            y1=brooklyn_builds.values.max()*1.2,
            fillcolor="LightSalmon",
            opacity=0.5,
            layer="below",
            line_width=0,
        ),
    dict(
            type="rect",
            x0="2000",
            y0=0,
            x1="2002",
            y1=brooklyn_builds.max()*1,
            fillcolor="LightSalmon",
            opacity=0.5,
            layer="below",
            line_width=0,
        ),
         dict(
            type="rect",
            x0="1955",
            y0=0,
            x1="1975",
            y1=brooklyn_builds.max()*1,
            fillcolor="LightSalmon",
            opacity=0.5,
            layer="below",
            line_width=0,
        )],
    annotations=[
            dict(text="The Great Recession", x='2007', y=brooklyn_builds.values.max()*1.2),
        dict(text="  Dot-Com Bubble", x='2001', y=brooklyn_builds.values.max()*1),
        dict(text="  Vietnam War", x='1955', y=brooklyn_builds.values.max()*1)])

2. Change in Market Valuations

Manhattan has consistently been one of the most expensive property markets in the world, with a super-prime market in possession of ultra-luxury developments such as the Billionaires Row. It has been identified that there were “more houses bought north of $25 million on Manhattan’s 57th Street in the last five years than any other road in the world” (Sidders, 2020). The graph below clearly shows condominium market valuations are much higher on average in central Manhattan than any other Borough within NYC, peaking at $32.41M in 2018. Not only are the properties located in Manhattan likely to be valued higher due to the prime location, they are the largest developments (within the 2000–2019 Deep Dive) with some condominiums containing as many as 1432 units.

#combining all the plots for market value!
# visualisation
fig = go.Figure()
fig.add_trace(go.Scatter(x=manhattan_mean_value.index, y=manhattan_mean_value.values,
                        mode='lines',
                         name='Manhattan',
                        ))
fig.add_trace(go.Scatter(x=bronx_mean_value.index, y= bronx_mean_value.values,
                        mode='lines',
                         name='Bronx',))
fig.add_trace(go.Scatter(x=brooklyn_mean_value.index, y=brooklyn_mean_value.values,
                        mode='lines',
                        name='Brooklyn',))
fig.add_trace(go.Scatter(x=queens_mean_value.index, y= queens_mean_value.values,
                        mode='lines',
                        name='Queens',))
fig.add_trace(go.Scatter(x=staten_mean_value.index, y= staten_mean_value.values,
                        mode='lines',
                        name='Staten Island',))
fig.update_layout(
    template='gridon',
    title='Average Market Value by Borough',
    xaxis_title='Year',
    yaxis_title='Average Market Value $',
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    legend=dict(y=-.2, orientation='h'))

There was a significant rise in market value within the Staten Island borough, with one condominium rising by $1,840,000 from 2017 to 2019 and another rising $6,033,000 from 2016 to 2019.

The gentrification of Brooklyn

Brooklyn experienced huge gentrification and the value of its buildings reflected this. As a result of Brooklyn’s growing luxury market, several neighborhoods within the area became only accessible to the very wealthy. Thus, as Brooklyn was once the cheaper alternative to Manhattan, the borough of Queens began to gain in popularity resulting in a huge spike in the market valuation of condominiums.

Conclusions

The build year appears to be a suitable proxy for economic performance. Periods of financial crisis, on the whole, reflected less new builds across NYC.

Further insights indicate that a huge spike in the average square footage of developments being built occurred around 1933, which corresponds to the construction of Parkchester, an entirely new neighborhood in the Bronx (Conde, 2020).

#combining all the plots for new builds by borough!
# visualisation
fig = go.Figure()
fig.add_trace(go.Scatter(x=manhattan_builds.index, y=manhattan_builds.values,
                        mode='lines',
                         name='Manhattan',
                        ))
fig.add_trace(go.Scatter(x=bronx_builds.index, y= bronx_builds.values,
                        mode='lines',
                         name='Bronx',))
fig.add_trace(go.Scatter(x=brooklyn_builds.index, y=brooklyn_builds.values,
                        mode='lines',
                        name='Brooklyn',))
fig.add_trace(go.Scatter(x=queens_builds.index, y= queens_builds.values,
                        mode='lines',
                        name='Queens',))
fig.add_trace(go.Scatter(x=staten_builds.index, y= staten_builds.values,
                        mode='lines',
                        name='Staten Island'))
fig.update_layout(
    template='gridon',
    title='Condominiums by Average SqFt by Borough',
    xaxis_title='Year',
    yaxis_title='Average SqFt',
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    #legend=dict(y=-.2, orientation='h'),
)

The market value of condominiums was only recorded from 2012 to 2019, as these were the years reported by the DOF. There is a clear increase in average market value across NYC as an aggregate.

# market value as an aggregate
yvalue = data2000.groupby('Report Year')['Full Market Value'].mean()
fig = px.line(yvalue, x= yvalue.index, y=yvalue.values)fig.update_layout(
    template='gridon',
    title='Average Market Value across NYC',
    xaxis_title='Report Year',
    yaxis_title='Full Market Value $',
    xaxis_showgrid=False,
    yaxis_showgrid=False
)fig.show()pd.set_option("display.precision", 0)
# histogram for condos by year built
num_bins = 20
data_value=data2000["Year Built"]
plt.hist(data_value, num_bins, facecolor='navy', alpha=0.5)
plt.xlabel("Year Built")
plt.ylabel("Count")
plt.title("Condominiums by Year Built")
plt.show()

Further analysis will look at logistical regression and K-means clustering on this dataset. Stay tuned!

You can find my code on my github:

https://github.com/rorybain96/NYC-Condo-Analysis/tree/main

Data Analysis of New York City’s Condo Market using Python

Learn about the trends of New York City’s property market

Challenges:

The Analysis

You can find my code on my github:

Written by Cd