
Executive Summary
The Boston Housing Market dataset is ubiquitous but imperfect: with problems like small size, inconsistent definitions and incorrect coordinates. However, it is still a very rich dataset containing informative geographical information, powerful socioeconomic indicators, and continuous levels of Nitrogen Oxides (NOx). This project explores the effect of developing low income neighbourhoods on NOx. The project won the first place prize in the 24-hour AI Hack 2021. The links below are for those interested to delve further into the analysis.
Aims and Objectives
The aim of this project is to explore novel avenues in the Boston Housing Dataset. Typical projects with the dataset focus on regression or clustering. While both interesting, our goal was to answer a research question that could lead directly to Policy recommendations.
This is done through 3 objectives:
- Verify that the dataset is informative to solve the research question: An important assumption our question makes is that the data is informative enough to separate towns by income. Success Criteria: find clusters that are comparable to external literature.
- Find the best model for predicting NOx levels: The project involves finding a regression model to predict NOx levels, as well as choosing the best regressors for this. Success Criteria: find models that give high accuracy and normally distributed residuals.
- Simulate ‘development’ in low-income towns: What does this mean that a town has been ‘developed’? Are any parameters constrained, for example by geography? Are there socioeconomic parameters that could be changed by policy recommendations? Success Criteria: find a reasonable way to simulate ‘development’ in a low income town.
Data processing and feature extraction
Our first step was to explore the data visually, using the _pandasprofiling library. This enabled us to get a quick understanding of the data in terms of continuous, ordinal and categorical variables.
We concluded that there were two types of features:
- Geographically informative data: this includes DIS (distance from employment centres) and RAD (road index) for example.
- Socioeconomic indicators: this includes AGE (mean house age) and CMEDV (house prices) for example.
We noticed that some variables, such as CHAS, and therefore excluded them from any analysis.
We plotted each dwelling on folium, where we realised that the latitudes and longitudes are erroneous. These were fixed using Goolge’s geocoder API (the methodology for this is described here).
Main aspects of Methodology
Objective 1: Verifying that our dataset is informative enough
To verify that our dataset contains sufficient information to answer the research question, we had to find a way to cluster towns by income, and compare our results with literature. Since we cared for 3 clusters: low, medium and high income towns, we chose the simple K-means algorithm as a starting point for clustering towns. The outcome was actually representative enough of literature (see page 11 of our report), so we didn’t feel a need to look for more complex clustering methods. The dwellings are shown in Figure 1 below.

We selected the following 4 predictor variables as our features for K-means, based on our understanding of how informative they for clustering, and also our reading of the correlation coefficients.
- CMEDV – numeric vector of corrected median values of owner-occupied homes in 1000 USDs
- INDUS – numeric vector of proportions of non-retail business acres per town (constant for all Boston tracts)
- AGE – numeric vector of proportions of owner-occupied units built prior to 1940
- LSTAT – numeric vector of percentage values of lower status population
Before any scaling, LSTAT was highly skewed and contained many outliers. We wrote an optimiser that determined the ideal transformation (and normalisation) function for the data, based on the kurtosis and skewness of the post-transformation distributions. Figure 2 below shows the distributions and boxplots for our selected features after the transformation was conducted.

Objective 2: creating a regression model to predict NOx levels accurately
Since the NOx outputs are continuous, a regression model was deemed appropriate. In addition to the 4 predictors used for clustering, we chose an extra one, CRIM (an indicator of crime in a given town) as well, given it’s correlation with the NOx emissions. Our target variable of course was the NOx emissions.
We started with a simple regression model for a benchmark, and then used different models to improve on this. Our methodology is given in the flowchart below.

The best model turned out to be SVR. This is expected since support vector machines can: a) capture non-linear behaviour, but also b) capture information from ordinal and categorical variables. Since one of our predictors, RAD, is actually an ordinal index, normal regression models that rely on ‘distance’ are not appropriate, whereas the SVR can capture this information well.
Further to SVR and standard linear regression, Ridge, Lasso and neural network based regression were also used. The latter two performed fairly poorly. In the case of Lasso regression this is expected, but outcome of the neural network was rather surprising. However, given time constraints during the competition, the neural network was not optimised further since the R2 score (and model simplicity) achieved by the SVR was deemed sufficient.
A summary of the top 3 models is provided in Table 1 below.

Figure 4 below shows the predicted outputs of the SVR plotted against the true data with respect to 4/5 regressors (the ordinal predictor RAD is omitted).

In order to verify that our regression model worked correctly, we decided to plot the residuals on a QQ-plot to verify our assumption of normality. This is shown in Figure 5 below.

We also used the method of partial dependence (for more information on this, please refer to page 14 of our report) to measure the impact of varying predictors on the NOx levels, while keeping everything else constant. We were only interested in measuring the difference when changing socio-economic indicators (since these are changeable through policy, whereas geographical parameters are fixed). As a result, we only examined the effects of AGE and CRIM. The outcomes are shown in Figure 6 below.

Objective 3: simulating development of low income towns
In order to answer our research question, "How will developing low income towns in Boston affect NOx levels?" we needed to define a measure for development. For this, we make use of the fact that the features we selected can be split into geographically and non-geographically constrained.
In our case, we note that AGE and CRIM are the only features that are not geographically constrained. This means that when it comes to policy recommendations, we could recommend renovating houses, or pushing for policies that decrease crime. With our other predictors, namely DIS, INDUS and RAD, we couldn’t possible make policy recommendations without causing radical change. For example:
- Changing DIS or INUDS means moving / shutting down employment centres and industrial sites, which would have significant economic ramifications
- Changing RAD indicates a complete change in road systems of Boston, which would once again have significant impacts
As such, we refer to these as geographically constrained features, since changing them would require complete re-planning of the city.
Having split our data in these two categories, we can then simulate the development of low income neighbourhoods by replacing non-geographically constrained features from that of bootstrapped high income neighbourhoods.
This means that for every dwelling in our low income towns, we replace the AGE and CRIM predictors with samples drawn from a bootstrapped distribution based on the high income towns. We then feed our new ‘developed’ datapoints into our NOx regression model to predict the new NOx values.
The outcomes of this are shown in Figure 7 below:

The disparity is clear: the results of this analysis suggest that there is an incentive to uplift low income towns because it would potentially decrease NOx levels. This provides even non-humanitarian reasons to pursue this initiative. Figure 8 below shows the effect that uplifting low income neighbourhoods by bootstrapping from medium and high income areas has on the distributions. It would appear that the peak of the distribution remains the same, with little change, but that the data is more dispersed towards the left, leading to an overall positive skew.

Discussion and Policy Recommendations
The results for the partial dependence analysis (see Figure 6) are both interested and confusing. The impact of AGE is mostly logical, i.e. an increase in house AGE would indicate older infrastructure for heating, and thus for NOx emissions. The one for CRIM however is confusing, because it seems to suggest that, all other things equal, increasing crime will increase the NOx levels. A notion that is hard to believe without checking for other dependencies.
In terms of other recommendations: INDUS explains NOx quite a bit. A recommendation for governments could be encourage industries to move out of cities, and to build in more rural areas (as opposed to the city centre), since this affects both congestion and overall pollution density. However, more research needs to be done in determining the socioeconomic impact that this will have on the low income neighbourhoods, as many residents may lose their jobs as a result.
However, it’s worth noting that there are problems with the dataset, that impact the effectiveness of this study. For example, the geographical parameter CHAS has no effect on anything, while intuitively one would think so, and he literature seems to suggest so too. Another interesting thing to note is that the dataset is quite thin… much of the affluent neighbourhoods fall in the outskirts of the city. Though this appears to agree with literature, the dataset has imbalanced classes.
It’s interesting to look at some probabilities based on the models and existing online thresholds. Based on online standards, California’s max NOx levels (parts per 10 million) are 0.3, and the federal government sets the limit at 0.53. If California’s limit is to be taken as the threshold low NOx, and the government’s for high NOx, this means that 52% of the dataset has NOx levels that are too high, none which are actually low. Another interesting insight about the disparity… given a low income neighbourhood, the probability of having a high NOx rating is 96.6%. The table below summarises this information nicely:

Concluding Remarks and Further Work
This project’s goal was to explore how developing low income neighbourhoods affects NOx levels in order to provide insight into policy making. This goal was broken down into three logical steps: 1) determining if the dataset is sufficient for the task at hand, 2) finding a model for predicting NOx levels across any data point, 3) determine the impact of developing a low income neighbourhood while keeping the geographical constraints constant. In order to verify the dataset’s sufficiency, K-means clustering was used for finding 3 distinct clusters (corresponding to low, medium and high income neighbourhoods). This outcome was verified against past data showing similar distributions, meaning that the data is good enough to reasonably cluster neighbourhoods into three classes.
The second aspect of the project applied regression techniques to determine NOx emissions using 5 regressors, namely AGE, DIS, INDUS, CRIM, and RAD. This aspect found SVR to be the best predictor, giving an accuracy of 88%. The normality of the residuals was verified through QQ plots, which then allowed for a bootstrapped sample of high income neighbourhood data points to be created. Keeping geography fixed (i.e. INDUS, DIS and RAD), improvable from the low neighbourhood dataset were replaced (i.e. AGE, CRIM) with those from the high income neighbourhood. It was found that caused an overall decrease in the NOx levels. The statistical effect is that the distribution of NOx in the low income neighbourhoods doesn’t change its peak. but has its surroundings dispersed to the left, causing an overall positive skew as development increases. Based on the research, it appears that CRIM and AGE are strong humanly-changeable indicators of NOx pollution. The evidence suggests that improving low income neighbourhoods correlates with lower NOx values.
Future work would do well to consider the following: 1) causality: could you explore if AGE / CRIM can be deemed to cause, 2) data augmentation: this work did much in terms of augmenting the longitude and latitudes, and creating synthetic data for ‘developed’ low income neighbourhoods, future methods could look to using neural network driven data augmentation techniques (such as GANs).
I hope that you enjoyed reading this article (and hopefully learnt something new!), and encourage you to always aspire to explore novel avenues in the projects that you work on, even if it seems like everything has already been explored.
All images created by author unless indicated otherwise.