The world’s leading publication for data science, AI, and ML professionals.

Predicting Singapore HDB Resale Price: EDA and Modeling

Conduct EDA and try out different ML models for resale price predictions

Photo by Mike Enerio on Unsplash
Photo by Mike Enerio on Unsplash

In the previous post, we have performed data wrangling for the Singapore HDB resale Price dataset and identified some features that should be useful to build up the model later on. Before moving on to modeling, let’s explore the dataset a bit more to see if there are any interesting patterns we can find.

For this project, I used SAS JMP software to do data visualization. JMP is quite convenient for simple data visualization tasks with a user-friendly user interface. You may want to refer to their official website for more details.

What is the overall trend for HDB resale price over the years?

From the website, we can obtain the HDB resale price data ranging from 1990 to 2018. Let’s see what is the overall trend for the median resale price:

Singapore HDB Median price from 1990 to 2018 (Image by author)
Singapore HDB Median price from 1990 to 2018 (Image by author)

From this plot, it is observed that there are 2 peaks for HDB resale price. The first peak came in the year 1997, followed by a significant drop which was mainly influenced by Asian Financial Crisis. After that, the HDB resale price started to increase again starting from 2006 and reached a new peak in the year 2013. Then the price of HDB resale flats dropped again, which coincide with a series of cooling measures in the public housing market, such as the ABSD framework. From the Year 2014 until now, the overall HDB resale price is quite stable looking from the median value.

From this analysis, we get a rough idea about the overall trend for HDB resale price for the past 20 years. We can observe that Singapore HDB price might be affected by government policies and overall economic trends a lot.


Do different town areas have similar price trends?

From the first plot, we learned that the median HDB flat price from 2014 until now looks stable. Does this apply to all the town areas in Singapore? I have these doubts in mind, as I did notice some news that the HDB prices in central areas reaching a historical high in recent years.

4-Room HDB flat median resale price partitioned by town area (Image by author)
4-Room HDB flat median resale price partitioned by town area (Image by author)

This plot shows the median 4-Room HDB flat price partitioned by town area. It is clearly observed that post year 2013, the central area HDB price goes up abruptly with an even higher slope than the previous years. For other central regions like Queens Town, Bukit Timah, and Bukit Merah, the resale price also runs on the higher side. For town areas that are far away from CBD (such as Chua Chu Kang, Sembawang, etc), it is observed that the resale price does keep decreasing since 2013.

From this graph, we can see that even with all the cooling measures, the center area HDB resale price still goes up. People are willing to spend more money for HDB in central areas to enjoy the convenience it brings. Moreover, the available spaces for building new HDBs in the central area are very limited, which also makes the resale market in the central area much hotter than in other regions. For the towns that are far away from the city, there are usually more spaces available for new HDB flats, so people have higher chances to get BTO flats instead of buying resale HDB. As a result, the resale price tends to drop after the cooling measures take place.


Travel time to CBD matters

From the previous analysis, we can see that the center area has a higher HDB resale price. In general, center areas mean easier access to shopping malls and other facilities, more convenient transportation, etc. How is the travel time from different locations to the central area affecting the HDB resale price?

4 ROOM HDB from the year 2010–2018, median resale price v.s. travel time to Raffles Place MRT (Image by author)
4 ROOM HDB from the year 2010–2018, median resale price v.s. travel time to Raffles Place MRT (Image by author)

From this plot, we can see a good linear correlation between HDB resale price and travel time to Raffles Place MRT, which matches our expectations as well. If only considering the travel time to the CBD area, are there any under-valued town locations that buys can pay more attention to?

Median resale price v.s. Travel time to Raffles Place MRT (Image by author)
Median resale price v.s. Travel time to Raffles Place MRT (Image by author)

This graph can give us some insights between the town location and the resale price. For HDBs in Central Areas, Queens Town, and Bukit Timah, they generally have a higher resale price than the rest towns with similar traveling time. In other words, these HDBs are potentially over-priced if only talking about their relative distance to the downtown area. On the other hand, HDB flats in areas such as Geylang/Kallang, Serangoon, and Ang Mo Kio may be a good deal for potential buyers who want to stay closer to CBD and with a limited budget. These town areas are more cost-effective considering they have a lower price comparing with other towns with similar travel time to downtown(could be as much as ~200K).


Choosing a flat-type for modeling

Last but not least, I plot the HDB resale price v.s. year, partitioned by different flat types.

HDB Resale Price v.s. year, partitioned by flat type (Image by author)
HDB Resale Price v.s. year, partitioned by flat type (Image by author)

Not surprisingly this time, we can see that the general resale price trends for different flat types are similar. Since we have enough data samples (>850K rows of data in total), I decided to choose 4 Room flat types for modeling, as it is the most popular flat type in Singapore. In addition, I only used the data samples from the year 2005 onwards, since more recent data would be more representative and relevant for future HDB resale price prediction.


Time for modeling!

Finally, we have reached the exciting part: modeling! We will try out several models and compare the accuracy for each model, and more importantly, discuss why certain models have better performances.

As mentioned before, there are 8 features I selected for the final modeling. 4-Room flat type HDB resale transactions data since 2005 are used as input data for model training and testing.

Final Dataset to be used for the resale price prediction model (Image by author)
Final Dataset to be used for the resale price prediction model (Image by author)

Linear Regression Model

I started with a linear regression model. The main advantage of a linear regression model is fewer parameters to tune, and the results are relatively easy to interpret. For this project, I used the sklearn library which is the most popular toolkit nowadays for Data Science projects.

The code for building up this model is quite simple, I will not elaborate too many details here. All the source code will be available at this link.

Linear Regression Model(Image by author)
Linear Regression Model(Image by author)

One thing to highlight here is that the linear Regression function uses the (coefficient of determination) regression scoring function. This enables us to visualize the testing results in percentage form.

_R² is defined as (1 – u/v), where u is the residual sum of squares ((y_true – y_pred) 2).sum() and v is the total sum of squares ((y_true – y_true.mean()) 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse)._

With the linear regression model, I got ~79.7% accuracy from the test data set without any additional parameters tuning. A pretty good start!

Linear Regression Model Prediction value v.s. Actual value (Image by author)
Linear Regression Model Prediction value v.s. Actual value (Image by author)

This is the graph showing the correlation of true values and prediction values from the Linear Regression model. We can see they generally follow the linear trend. We can also notice that the point density in the range of 30K to 55K is higher – In a sense, the resale price dataset is also imbalanced with more transactions ended in the 30K to 55K resale price range.


Gradient Boosting Regression Model

While 79.7% accuracy is a good start, are we able to achieve higher accuracy? The Linear Regression model seems does not have much room for further improvement. Given our dataset is imbalanced in nature, with some further research I recall that the gradient boosting regression model could be a good choice in our case.

Gradient Boosting Tree Algorithm is an ensemble learning method that builds trees one at a time, where each of the new trees will put more weights on the data samples having a higher classification/regression error. It is good for processing imbalanced datasets, where the algorithm can put more weights on the data samples which are harder to predict.

Similarly, there is an available library in sklearn toolkit that we can directly use. The main parameters to tune are as follows:

n_estimator – The number of boosting stages to perform. We should not set it too high which would overfit our model.

max_depth – The maximum depth of the tree node. Too high will cause overfitting issues.

learning_rate – Rate of learning the data.

loss – loss function to be optimized. ‘ls’ refers to least squares regression, which is similar to our previous linear regression model.

Gradient Boosting Regression Model (Image by author)
Gradient Boosting Regression Model (Image by author)

After training the model, we use the same test dataset to check the accuracy. And we reached 95.5% accuracy – pretty amazing! I have plotted the same correlation plot as the linear regression model, it is quite obvious that the result has a significant improvement.

Gradient Boosting Model Prediction v.s. Actual (Image by author)
Gradient Boosting Model Prediction v.s. Actual (Image by author)

The _feature_importances_ metric is an indication of what are the weights for respective features that determine the final HDB resale price. It is observed that floor_area, distance to closest shopping mall & MRT stations, and remaining lease are having the largest impact on the HDB resale price (≥ 15% weightage each). The travel time to CBD, storey level and town locations have ~10% impact each for the final resale price. The least important feature turns out to be the flat_model type, which accounts for 5% of the overall resale price predicted value.


Linear Regression v.s. Gradient Boosting Regression

We see that the gradient boosting model has a really good result comparing to the linear regression model, with ~15% accuracy improvement. Why gradient boosting regression model outperformed the linear regression model so much?

This suggests that from the given features, there should be non-linear elements that are not captured by the simple linear regression model. The gradient boosting tree model uses decision trees as weak learners. After calculating the loss, the model will select a decision tree that minimizes the residual loss after each iteration to continuously improve the prediction accuracy. This is something that a pure linear regression model cannot achieve.

Although the GBR can achieve higher prediction accuracy, it also has the downside which it is harder to interpret than the linear regression model. We can only get a feel about what is the weightage of different features from the _featureimportance metrics mentioned above.


Conclusions and Recommendations

In this project, we have performed a comprehensive study for the Singapore HDB resale price dataset. We have gone through the dataset pre-processing, feature engineering, data exploration&visualisation and modeling stage for a data-science project. The key factors that would affect HDB resale prices have been identified. Finally, a gradient boosting model for resale price prediction model is constructed with a prediction accuracy > 95%.

Although we have reached a high prediction accuracy, it should be also noted that the real HDB resale price is much more complex than what we have discussed in the model. There are many other factors such as overall economic conditions, government regulation, also the proportion of young working adults in respective towns that may play a role in the final resale price. These factors can be added for future works for further improvements.

When people are really considering buying an HDB, their priorities also vary case by case. For example, for a couple who works in Micron (which locates on the north coast of Singapore), they may just choose to buy an HDB in Woodlands or Yishun district which is close to their working place. For them, the travel time from home to CBD may not be of high priority, since most of the time they will just stay in the neighborhood areas in the north part.

We have also identified several town locations such as Kallang, Ang Mo Kio, and Serangoon, which might be a good place to consider for people who are looking for locations near CBD with a lower price.

Thank you again for following through with my posts! Please Feel free to leave your comments and suggestions below. In case you missed my previous post on the data preparation part, you may check it out below!

https://medium.com/@tianjie1112/predicting-singapore-hdb-resale-price-data-preparation-be39152b8c69


Related Articles