
Data visualization is an integral part of exploratory data analysis. A well-designed visualization provides much more insight than plain numbers do. Thus, there are several data visualization libraries in Data Science ecosystem.
In this article, we will try to explore and gain insight into Melbourne housing dataset available on Kaggle.
Altair is a statistical visualization library for Python. Its syntax is clean and easy to understand as we will see in the examples. It is also very simple to create interactive visualizations with Altair.
Altair is highly flexible in terms of data transformations. We can apply many different kinds of transformations while creating a visualization. It makes the library even more efficient for exploratory data analysis.
We will be using Pandas for reading the dataset and some data manipulation tasks. Let’s start by importing the libraries and reading the dataset.
import numpy as np
import pandas as pd
import altair as alt
melb = pd.read_csv("/content/melb_data.csv", parse_dates=['Date']).sample(n=5000)
melb.shape
(5000,21)
melb.columns
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method',
'SellerG','Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car','Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude','Longtitude', 'Regionname', 'Propertycount'], dtype='object')
The dataset contains 21 features (i.e. columns) about 13580 houses in Melbourne. However, I have created a random sample of 5000 houses because Altair accepts datasets with at most 5000 rows by default. It is possible to increase this number but it is beyond the purpose of this article.
The target variable is the price column. The other columns are supposed to provide valuable insight into the price of houses. We can start by checking the distribution of house prices.
One way is to create a histogram. It divides the value range of a continuous variable into discrete bins and counts the number of observations (i.e. rows) in each bin.
alt.Chart(melb).mark_bar().encode(
alt.X('Price:Q', bin=True), y='count()'
).properties(
height=300, width=500,
title='Distribution of House Prices in Melbourne'
)

Let’s elaborate on the syntax. The object that stores the data is passed to the top-level Chart object. Then next step is to define the type of plot.
The encode function specifies what to plot in the given dataframe. Thus, anything we write in the encode function must be linked to the dataframe. Finally, we specify certain properties of the plot using the properties function.
The make_bar function can be used to create a histogram with a data transformation step. Inside the encode function, we divide the value range of price variable into discrete bins and count the number of data points in each bin.
Most of the houses are less than 2 million. We also see some outliers that cost more than 4 million.
The location is a significant factor in determining the price of a house. We can visualize the effect of distance to central business district (CBD) on house prices in Melbourne. Since both are continuous variables, scatter plot is a good choice for this task.
alt.Chart(melb).mark_circle(clip=True).encode(
alt.X('Distance'),
alt.Y('Price', scale=alt.Scale(domain=(0,5000000)))
).properties(
height=400, width=600,
title='House Price vs Distance'
)
There is a couple of points I want to emphasize here. I have used the scale property to exclude the outliers with prices higher than 5 million. In order to use the scale property, we specify the column with Y encoding ( alt.Y(‘Price’) ) instead of passing a string ( y=’Price’ ).
The other point is the clip parameter of the mark_circle function. If we do not set it as True, the data points that fall outside the specified axis limits will still be drawn. Believe me it does not look well.
The scatter plot generated by the code above is:

In general, distance to CBD has a negative effect on the house prices in Melbourne.
There are several regions in Melbourne. Let’s first check the proportion of each region in the dataset.
melb.Regionname.value_counts(normalize=True)
Southern Metropolitan 0.3530
Northern Metropolitan 0.2830
Western Metropolitan 0.2106
Eastern Metropolitan 0.1096
South-Eastern Metropolitan 0.0338
Eastern Victoria 0.0042
Northern Victoria 0.0034
Western Victoria 0.0024
melb.Regionname.value_counts(normalize=True).values[:4].sum()
0.96
96 percent of the houses are located in the top 4 regions. We can compare the house prices in these regions.
One option is to use a box plot which provides an overview of the distribution of a variable. It shows how values are spread out by means of quartiles and outliers.
alt.Chart(melb_sub).mark_boxplot(size=80, clip=True).encode(
alt.X('Regionname'),
alt.Y('Price', scale=alt.Scale(domain=(0,5000000)))
).properties(
height=400, width=600,
title='House Prices in Different Regions'
)
I have only included the houses that cost less than 5 million. The size parameter of the mark_boxplot function adjusts the size of boxes.

Let’s spend some time to customize this visualization to make it look better. We will do the following changes using the configure_title and configure_axis functions.
- Adjust the font size of title
- Adjust the font size of axis ticks and labels
- Adjust the rotation of axis ticks
- Remove the title of x axis because it is already described in the title of the plot
alt.Chart(melb_sub).mark_boxplot(size=80, clip=True).encode(
alt.X('Regionname', title=""),
alt.Y('Price', scale=alt.Scale(domain=(0,5000000)))
).properties(
height=400, width=600,
title='House Prices in Different Regions'
).configure_axis(
labelFontSize=14,
titleFontSize=16,
labelAngle=0
).configure_title(
fontSize=20
)

I think it looks better and more clear now.
The houses in the southern metropolitan region are more expensive in general. The prices are more spread out in that region as well. In all regions, we observe outliers with very high prices compared to the overall price range.
We may also want to check the trend in house prices in a year. For instance, the prices might tend to increase towards a particular month or season.
You can change the frequency but I will plot the average weekly house prices. The first step is to extract the week number from the date using the dt accessor of Pandas. Then, we calculate the average price in each week by using the groupby function.
melb['Week'] = melb.Date.dt.isocalendar().week
weekly = melb[['Week','Price']].groupby('Week',
as_index=False).mean().round(1)
We can now create a line plot.
alt.Chart(weekly).mark_line().encode(
alt.X('Week'),
alt.Y('Price', scale=alt.Scale(zero=False))
).properties(
height=300, width=600,
title='Average Weekly House Prices'
)

We do not observe a specific trend but there are some peak values. It may require further investigation to make sense of such peaks.
Conclusion
We have created several different types of visualizations to explore the Melbourne housing dataset.
This article can also be considered as a practical guide for Altair library. The type of visualizations in this article can be applied to pretty much any dataset.
If you’d like to learn more about Altair, here is a list of articles that I previously wrote:
- Part 1: Introduction
- Part 2: Filtering and transforming data
- Part 3: Interactive plots and dynamic filtering
- Part 4: Customizing visualizations
Thank you for reading. Please let me know if you have any feedback.