Machine Learning- Predicting House prices with Regression

Pritha Saha
Towards Data Science
3 min readOct 19, 2019

--

Running algorithms to get the most accurate results

This article is the last of my series, on the Housing dataset. For the uninitiated, I had already covered EDA and Feature Engineering in previous two articles.

Summarising the work so far- we covered the awfully mundane work of data munging in EDA and meticulous re-engineering of features in the second article. We explored all the variables, decided what to keep and what to drop based on the relevance of the variable with respect to the target value. We are finally down to 64 carefully chosen features to train the dataset and predict the final house prices!

To start with, we split the data set into train and test in a 80:20 ratio.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .20, random_state = 42)

Next we use Random Forest Regressor to train the dataset, along with Random Search CV to obtain the best hyper parameters.

rf = RandomForestRegressor(random_state = 42)#Hyperparamater tuning using RanodomSearchCV

random_grid = {
'n_estimators': [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [6,7,8,9,10],
'min_samples_split' : [2, 5, 10],
'min_samples_leaf' : [1, 2, 4]
}
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 5, verbose=2, random_state=42, n_jobs = -1)

Lastly we fit the model on the training set and get the best score possible.

rf_random.fit(X_train, y_train)
print(rf_random.best_params_)
print(rf_random.best_score_)

The best score is 0.87. On performing a grid search there’s a marginal increase in best score to 0.88.

#Hyperparameter tuning using GridSearchCVparam_grid = { 
'n_estimators': [int(x) for x in np.linspace(start = 600, stop = 2000, num = 10)],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [7,8,9,10],
'min_samples_split' : [2, 5],
'min_samples_leaf' : [1, 2]
}
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
print(grid_search.best_score_)

Lasso regression, along with random search gives a much less best score of 0.85.

XGBoost regression and random search CV gives a higher score of 0.9.

However I got the best score with Ridge Regression.

#Ridge Regressorparams_ridge ={
'alpha':[0.25,0.5,1],
'solver':['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
}
ridge = Ridge()
ridge_random = RandomizedSearchCV(estimator = ridge, param_distributions = params_ridge,
n_iter=50, cv=5, n_jobs=-1,random_state=42, verbose=2)
ridge_random.fit(X_train, y_train)
print(ridge_random.best_params_)
print(ridge_random.best_score_)
ridge_grid = GridSearchCV(estimator = ridge, param_grid = params_ridge, cv = 5, n_jobs = -1, verbose = 2)ridge_grid.fit(X_train, y_train)
print(ridge_grid.best_params_)
print(ridge_grid.best_score_)

Both random and grid search give me a best score of 0.92.

Hence we proceed with the best estimator and predict on the test set.

model_ridge = ridge_random.best_estimator_
y_pred_ridge = np.exp(model_ridge.predict(X_test))
output_ridge = pd.DataFrame({'Id': test['Id'], 'SalePrice': y_pred_ridge})
output_ridge.to_csv('prediction_ridge.csv', index=False)

This gave me a score of 0.12460 on Kaggle!

For the complete code refer to the below link: https://github.com/pritha21/Kaggle/blob/master/House_Prices.ipynb

You might need to view it using https://nbviewer.jupyter.org/

Any suggestions to help me improve my score are always welcome!

--

--

I love working with data and have been recently indulging myself in the field of data science.