Predicting house value using regression analysis

Bhavesh Patel
Towards Data Science
5 min readApr 17, 2017

--

Machine learning model

INTRODUCTION

Regression analysis is a basic method used in statistical analysis of data. It’s a statistical method which allows estimating the relationships among variables. One needs to identify dependent variable which will vary based on the value of the independent variable. For example, the value of the house (dependent variable) varies based on square feet of the house (independent variable). Regression analysis is very useful tool in predictive analytics.

E(Y | X) = f(X, β)

STATISTICS BEHIND THE ANALYSIS

It is easy to understand with a graph (source: Wikipedia)

Y = f(X) = 𝛽0 + 𝛽1 * X

𝛽0 is the intercept of the line

𝛽1 is the slope of the line

Linear regression algorithm is used to predict the relationship(line) among data points. There can be many different (linear or nonlinear) ways to define the relationship. In the linear model, it is based on the intercept and the slope. To find out the most optimal relationship, we need to train the model with the data.

Before applying the linear regression model, we should determine whether or not there is a relationship between the variables of interest. A scatterplot is a good starting point to help in determining the strength of the relationship between two variables. The correlation coefficient is a valuable measure of association between variables. Its value varies between -1 (weak relationship) and 1 (strong relationship).

Once we determine that there is a relationship between variables, next step is to identify best-fitting relationship (line) between the variables. The most common method is the Residual Sum of Squares (RSS). This method calculates the difference between observed data (actual value) and its vertical distance from the proposed best-fitting line (predicted value). It squares each difference and adds all of them.

The MSE (Mean Squared Error) is a quality measure for the estimator by dividing RSS by total observed data points. It is always a non-negative number. Values closer to zero represent a smaller error. The RMSE (Root Mean Squared Error) is the square root of the MSE. The RMSE is a measure of the average deviation of the estimates from the observed values. This is easier to observe compare to MSE, which can be a large number.

RMSE (Square root of MSE) = √ (MSE)

The additional number of variables will add more dimension to the model.

Y = f(X) = 𝛽0 + 𝛽1 * X1 + 𝛽1 * X2 + 𝛽1 * X3

TOOLS USED

  1. Python
  2. Graphlab
  3. S Frame (similar to Pandas Data Frame)

DATA LOADING

House data of Seattle, Washington area is used. It contains following columns and around 21,000 rows.

Id : date : price : bedrooms : bathrooms : sqft_living sqft_lot : floors : waterfront : view : condition : grade : sqft_above : sqft_basement : yr_built : yr_renovated : zipcode : lat : long : sqft_living : sqft_lot

> homesales = graphlab.SFrame(‘home_data.gl’)> homesales

We need to understand if there is a relationship between two variables. Let’s pick the housing price and square feet of living.

> homesales.show(view=”Scatter Plot”, x=”sqft_living”, y=”price”)

We can observe that there is a relationship between square feet of living area and housing price.

Let’s observer the data using the Boxplot with error lines to understand price by zip code.

> homesales.show(view=’BoxWhisker Plot’, x=’zipcode’, y=’price’)

It is always good idea to explore and understand the surrounding data. Graphlab has a nice way to show the data statistics.

> my_features = [‘bedrooms’, ‘bathrooms’, ‘sqft_living’, ‘sqft_lot’, ‘floors’, ‘zipcode’]

PREDICTIVE ANALYTICS

The first step is to get training data set and test data set. Let’s use 80% as training data and remaining 20% for testing.

> train_data, test_data = homesales.random_split(0.8, seed=0)

Let’s build regression model with one variable for sq ft and store the results. The dependent variable price is what the model will need to predict.

> sqft_model = graphlab.linear_regression.create(train_data, target=’price’, features=[‘sqft_living’], validation_set=None)

We can plot the model values along with actual values using matplotlib.

> plt.plot(test_data[‘sqft_living’],test_data[‘price’],’.’, test_data[‘sqft_living’],sqft_model.predict(test_data),’-’)

Blue dots represents the test data displaying relationship between house price and square feet of living area. Green line shows the prediction of the home price (dependent variable) for given square feet using the “sqft_model” linear regression model we built.

RESULTS

Let’s pick a house and predict its value using the “sqft_model”.

> house2 = homesales[homesales[‘id’]==’5309101200']> house2

Now let’s predict the house value using the “sqft_model”.

> print sqft_model.predict(house2)

[629584.8197281545]

It predicted value of $629,584 which is very close to actual value of $620,000.

While our model worked reasonably well, there is a gap between predicted value and actual value.

> print sqft_model.evaluate(test_data)

{‘max_error’: 4143550.8825285938, ‘rmse’: 255191.02870527358}

The “max_error” is due to the outlier. It is displayed in the upper right corner in the matplot visualization. The model has error value based on RMSE of $255,191.

--

--