
After more than 3 years of studying critical concepts in the fields of Data Science, developing the necessary skills to implement, design and evaluate Machine Learning and Predictive models has become the norm.
Yet, its applications continue to increase at an exponential rate as the demand for professionals with the skills to integrate and deploy machine learning models into various software architectures begins to soar.
Interestingly, throughout my academic experiences thus far, there was little to no exposure in deploying such models. Universities and colleges are over-emphasizing the theory of Machine Learning too much without showing the practicality behind it.
Let’s face it, many Data Science professionals today are full-stack.
What do I mean by full-stack? Essentially, you’re now expected to have some skills in every step of the Data Science Development cycle, beginning from requirements/business analysis all the way to scaling and deploying these models into a company’s existing infrastructure.
The skills of a Data Scientist is changing, skills that include:
- Mathematics: Calculus, Linear Algebra, Optimization, Statistics/Probability.
- Computer Science: Algorithms, Data Structures, Programming
- ML/AI: Neural Networks, Classification, Regression, Clustering, Tuning.
- And most recently, Software Engineering Practices: Version control, Documentation, Testing.
Not to mention domain-specific knowledge.
It’s because of these skills that we’re now deemed as the "Jack of all Trades … Master of None" or in other words "Unicorns" (not literally).
However, this raises an interesting question …
How are graduates expected too keep up with such growing demands if these practical concepts are withheld from their studies?
The answer, projects … lot’s of projects.
In reality, we can’t change the education system and we must take initiative in expanding our skills elsewhere. If your interested in following a path into STEM, know this …
You must be fully committed to a life of learning.
And in many cases, learning outside your classes.
In this article, I’ll be covering a recent personal project of mine which aims at deploying a multiple-linear regression model that predicts house prices into a website application using Python‘s Flask framework.
So buckle up and let’s begin!
Firstly, we need to cover some key metrics and goals of this application.
The goal? Predicting house prices in certain metropolitan areas based in the US.
We also need to understand some of the key variables/metrics to observe and study. Using a dataset sourced from Kaggle:
USA Housing: https://www.kaggle.com/vedavyasv/usa-housing
The variables to observe are:
- Average Area Income
- Average Area House Age
- Average Area Number of Rooms
- Average Area Number of Bedrooms
- Area Population
- Price
- Address
Firstly, "Address" will not be associated with our analysis as we are basing our predictions on numerical variables.
The point of the application is to predict the price of houses, hence our response variable will be "Price". As for area income, house age, number of rooms, number of bedrooms and area population, they could influence the price tag of houses dramatically.
Let’s now conduct some exploratory analysis into the dataset.

Looking at the correlation heat map above, we can see that the majority of the positive responses between all 6 predictor variables occur with the response variable "Price". Interestingly, there seems to be a positive correlation between Avg. Area Number of Bedrooms and Avg. Area Number of Rooms. We can also observe that the largest correlation is between Avg. Income and Price. This in particular makes sense considering that the quality and value of houses can depend on the average income of the area.
Let’s now look at some other observations:

We can see that the majority of all the variables are normally distributed however "Avg. Area Number of Bedrooms" has varying degrees of distributions and probabilities. As for the scatterplots on the right-hand side, all variables seem to show a positive linear trend with the exception of "Avg. Area Number of Bedrooms". Hence, the number of bedrooms may not positively contribute to the results.
Let’s also look at some sample statistics:

We have a sufficient number of samples with the same size. This could be one of the reasons as to why our variables are normally distributed, tying in with the Central Limit Theorem.
Also, the majority 50% in:
- "Average Area Income" have incomes between $75,783.34 and $61,480.56.
- "Average House Age" are between 6.65 and 5.32 years old.
- "Average Number of Rooms" is between 7.67 and 6.30 rooms
- "Average Area: Number of Bedrooms" have 4.49 and 3.14 bedrooms
- "Area Population" is 42861.29 and 29403.93 people
- "Price" lies between $1,417,210 and $997,577
Finally, let’s check for null values:

Thankfully, we don’t have any.
Thus, we can conclude that all variables are normally distributed and correlated with the exception of the number of bedrooms, devoid of null values.
Now that we have some understanding behind the data, let’s begin wrangling our data to obtain the right predictor and response variables. We’ll also convert their data types as integers:
Shortly after, we’ll split the data into a training and testing set and begin model fitting:
Let’s have a look at the model’s performance:
The model showed an R2-score of 89.396% for the training sets and 89.028% for the test set. This indicates that are our model shows strong linear relationships and tendencies between our response and predictor variables. Let’s also have a look at the RMSE as well as a visualization of our predictions:
The RMSE of our model was 114729.221 indicating that our residuals are spread out to a significant extent from the line of best fit, indicating that there’s less concentration of residuals surrounding it. Keep in mind however that RMSE is scale-dependent and there is not a specific threshold for which is good or bad, but as a rule of thumb … smaller is better.


As for the plots, the predictions vs the true values show strong, positive and linear relationships with a good sign of correlation hence explained by the R2 score earlier.
Overall, a decent model to say the least, but we could improve it in the long run. For now, let’s begin deploying this in Flask, the focal point of this project.
Firstly, we need to save our model. Using the pickle library, we’ll use the following script to save our regression model:
Understanding the basic mechanics of Flask took me a few hours to understand, particularly routing, setting up my environment and creating my working tree directory. But after finally understanding these concepts, I arrived with the following program:
The code above helps control the routing and template use of our website. In particular, def predict() is responsible for returning the predicted house price given user inputs which are collected using request.form.values(). Finally, we run the app with debug set to "True" so that we could see the website’s development and environment status in our local device.
I also used the following tree directory to complete this project:
/project
app.py
model.py
model.pkl
/template
/index.html #home page
/static
/img #all images stored in this folder
/styles #all CSS styles are stored here
Finally the basic structure of our html document:
When we run the program in our command prompt (in my case Anaconda since my program operates in a Anaconda virtual environment), we get the following web page:

Pretty terrible huh? Let’s improve the design with the following CSS styling:
Please excuse me and my ineptitude in CSS and HTML scripting. I really need to find more opportunities to use it.
Behold, the final product:

And by pressing "Predict", we get a predicted price of …

Now look at all those front-end devs … rolling in their graves right now…
I could’ve added a more impressive background image, but you know … copyright is an issue.
Funny enough, the CSS styling took longer than developing the app and model combined.
This project, although relatively simple in the grand scheme of things was the most insightful of all the projects I’ve completed. Why you may ask? Because the feeling of building an ML application from the ground up instead of purely presenting and reporting your discoveries was pure ecstasy. Loved every minute of it 🙂
Perhaps, in the near future I might even deploy this on Heroku or even AWS! Well, let’s focus on the design for now, let’s not get ahead of ourselves. Learning is a step-by-step process, we can’t afford to miss every step of the way.
There you have it! a relatively simple application that combines the best of theory and practice. Moreover, it gives an insight into some of the technologies and frameworks used to develop these applications. This one measly website required 3 different languages, multiple frameworks … and a few brain cells. But it was all worth it in the end, a learning experience like no other.
Hopefully this article has inspired you to develop more complex applications and ML models in the near future!
Remember, learning is not confined to just your studies, explore and conquer those skills elsewhere!