Plotly Experiments — Scatterplots

Naren Santhanam
Towards Data Science
5 min readMar 13, 2019

--

Let me start this post with a somewhat unpopular opinion: Data visualization in Python is an absolute mess. Unlike R, where ggplot pretty much rules the roost when it comes to graphing, Python has too many options to choose from. This is best summarized by this picture:

Courtesy: Jake VanderPlas (@jakevdp on Twitter)

Undoubtedly, matplotlib is the most popular, owing to the range of plots available and its object oriented approach to plotting. There are packages that build off of matplotlib to make plotting easier, e.g., Seaborn. The major disadvantages of matplotlib though, are the steep learning curve and the lack of interactivity.

There is another genre of plotting packages that are JavaScript based, such as Bokeh and Plotly. I have been playing around with the Plotly package recently, and it is certainly one of my favorite data visualization packages in Python. In fact, it is the second most downloaded visualization package, after matplotlib:

Plotly has a wide variety of plots and offers users a high amount of control on the various parameters to customize the plots. As I learn more about this package, I would like to walk through some of my experiments here as a way for me to practice and also serve as a tutorial for anyone who wants to learn.

Notes about Plotly

Plotly charts have two major components: data and layout.

Data — this represents the data that we are trying to plot. This informs the Plotly’s plotting function of the type of plots that need to be drawn. It is basically a list of plots that should be part of the chart. Each plot within the chart is referred to as a ‘trace’.

Layout — this represents everything in the chart that is not data. This means the background, grids, axes, titles, fonts, etc. We can even add shapes on top of the chart and annotations to highlight certain points to the user.

The data and layout are then passed to a “figure” object, which is in turn passed to the plot function in Plotly.

How objects are structured for a Plotly graph

Scatterplots

In Plotly, the Scatter function is used for scatterplots, line plots and bubble charts. We will just explore the scatterplots here. Scatterplots are a good way to examine the relationship between two variables, usually both of them continuous. It can show us if there is a clear correlation between the two variables or not.

For this experiment, I will be using the King County House Sales dataset,on Kaggle. Intuitively, house prices do depend on how big the house is, how many bathrooms there are, how old the house is, etc. Let us examine these relationships through a series of scatterplots.

The dataset contains a good mix of both categorical and continuous attributes. The price of the house is the target variable, and we can see how these attributes affect the price in this post.

Plotly has a scatter function, and also a scattergl function which gives better performance when a large number of data points are involved. I will be using the scattergl function for this post.

Is there a relationship between the living room area and the price?

That looks like a nice plot showing some relation between the living room area and the price. I think the relation would be better demonstrated if we plot the log(price) instead.

Usually, scatterplots showing a linear relationship are accompanied by the “line of best fit”. If you are familiar with the Seaborn visualization package, you are probably aware that it gives an easy way to plot a line of best fit, as shown below:

How do we do this in Plotly? By adding another ‘trace’ to the data component, as shown below:

Let us now look at a variation in the scatterplot. How can we show categories in the scatterplot through color? For example, does the data differ based on the no. of floors, bedrooms and bathrooms in the house? We can examine that by passing the color parameter to the marker in the Scatter function, as shown below:

Note that this gives a “colorscale” instead of a legend with different “grades” indicated in separate colors. Plotly assigns separate colors when there are multiple traces. We can accomplish it this way:

We can clearly see that in addition to the living room area, the grade of the house (assigned by King County) also affects the price of the house. But, what if we want to see the impact of grade, condition and other variables in this chart together at the same time? The answer: subplots.

Let us draw a stacked subplot of the chart shown above, but with no. of bedrooms, no. of bathrooms, condition, grade and waterfront as parameters.

In Plotly, we make subplots by adding the traces to the figure and specifying their position in the plot. Let’s see how.

The subplots functionality is available under “plotly.tools”.

Click on the picture to see the interactive plot

Grade looks like a very clear distinguishing factor to determine price along with the living room area. The other variables seem to have some impact, but we may have to conduct a regression analysis to examine that.

I hope this gave you some insight on how to use scatterplots in Plotly. I will practice column / bar charts in a subsequent post.

The code used for this post is available on GitHub.

--

--