
Introduction
One of the projects in my Flatiron Data Science program was to take a popular housing sales data set for King County, WA, and use it to gather insights and create a linear regression model. While there was plenty of data you would expect to find, such as prices, bedrooms, bathrooms, and so on, for the EDA portion of the project, I found this data to be boring and uninspiring. I wanted to do something unique with this project. Houses may get some people excited, and props to y’all out there, but for the projects I work on, I really try to find or create some aspect of it that I’d look forward to talking to people about. Housing prices just aren’t it for me.
I needed to find something cool.
Well, what’s cooler than maps? Ok, don’t answer that, I know a lot of things can be, but you can’t deny that maps are cool. They serve as an ultimate visualization, distilling information into an immediately recognizable form that we can use to determine relationships between objects in our space. This is why maps are a powerful tool.
So while there are some nice mapping libraries you can use with Python, I thought this would be a good opportunity to learn about QGIS and how I can use it to enhance my dataset. QGIS is a free and open-source geographic information system I learned about while looking for free alternatives to the popular, enterprise-level software ArcGIS.
Ideas
There were 3 things I wanted to add to my dataset that I thought would make things more interesting, unique, and personal, the first of which we’ll go through in this article.
- Labeled waterfront properties
- Housing locations relative to the UW hospital by commute time
- Housing locations relative to select breweries by distance
Why do I want to add these? Well, the labeled waterfront properties would give me a useful feature that others would typically drop due to the prevalence of null values. Drive time to the hospital is cool because I get to learn how to use Network Analysis tools within QGIS. Also, having lived in Seattle before, I relate to this feature on a personal level as my wife would commute to the hospital for work and we could use this feature to see if proximity to the hospital had an effect on housing prices. Lastly, I love food and drink and Seattle was amazing for that. I decided to use QGIS to geocode the address of some of my favorite breweries I used to frequent and return latitude and longitude coordinates. I could then plot them on a map and use the distances from each house within my final model! Relating this project to myself makes me much more motivated to work on it, and I’m happy to talk to people about my favorite Seattle breweries! If you’re interested in seeing the full project, feel free to check out my repo on github. Let’s get started on the first feature.
Waterfront Properties
The dataset included a column that was supposed to identify waterfront properties, but the vast majority of that data was missing as shown below. Yellow dots represent waterfront homes as labeled by the dataset.

As you can see, there are several homes that are on the waterfront that are incorrectly labeled. I thought that accurate waterfront labels would be a valuable feature as I would assume houses with a beautiful waterfront view would be more expensive. I was able to label waterfront homes by transforming a shape file and calculating distance using vector analysis tools within QGIS.
Here’s a quick outline followed by a walkthrough of the process
- Install QGIS
- Import data
- Acquire shape file
- Convert polygon(shape file) to points
- Clean noise from polygon
- Calculate distance from homes to shoreline
- Join distance data to main dataset
- Export as CSV
- Process in Python
Walkthrough
Install QGIS
QGIS can be painlessly installed from their website here. At the time of writing I’m using version 3.16.1 so expect things to look slightly different in later versions.
Ground Yourself!
When you first open QGIS you’ll be faced with a blindingly white background with tons of tiny buttons on the top and sides. If you’re anything like me you might stare blankly at this screen for a while. Similarly to how I felt when opening up Adobe Illustrator or VS Code for the first time, I was just trying to look around and grasp for anything that looked familiar before ending up disappointed and discouraged. This is about the time I head to Google, Youtube, or in this case, Medium!
Don’t worry y’all, let’s get our bearings straight here and start simple.
First, we’ll just get a basic map up on the screen so we have something to orient ourselves.

On the left side, you’ll see a column of options. If you simply expand "XYZ Tiles" and double-click on "OpenStreetMap" we’ll be able to import the open-source map into our project.
There we go, that feels better, doesn’t it? Feel free to zoom in and out and scroll all around if you like just to get familiar with navigating around a map in QGIS, you’ll find it’s pretty intuitive. When you’re done go ahead and get the general area you’re looking to view into the frame. In this case, I’m looking at North West Washington, or the Seattle area specifically.
Next, we’ll add our data

(Note: I imagine most of you would be working with CSV’s so this will be applicable, if not feel free to choose a more appropriate file type. Aside from CSV’s I think Vectors, Rasters and Mesh’s would be the most common.)
On the top row, you’ll find a "Layer" menu. From there you’ll want to navigate to Layer > Add Layer > Add Delimited Text Layer. This will bring up another dialogue box where you should input your correct latitude and longitude coordinates

When you’re done, click "Add" and you’ll see a satisfying smattering of dots to represent whatever happens to be at those coordinates! In this case, they’re houses. As I understand, the system randomly generates a color so if you don’t like the one it chooses for you you can right-click the layer name in the bottom left corner of the screen and click "Styles" to expand a color picker window.

Acquire shape file
Remember, our goal here is to identify which houses are along the water. We’ve got our houses plotted on our nice map, but now how are we supposed to find the ones that are on the waterfront? Well, I had a bunch of terrible ideas but finally decided that if I know where the water is, I can find out which houses are close to it.
I know that sounds too obvious to be of any worth, and honestly, I felt silly just typing it out and saying it in my head. It makes perfect sense though, in order to get information on how one thing relates to another thing, you need to know information about both things right? So here’s where we find water!
The shapefile I used was downloaded open-source from the United States Geological Survey (USGS) and is my first stop when searching for any GIS related datasets.
Once you find the shape file that works for your data. You can simply drag the file from your folder onto the map and it will establish itself as a layer!

Convert polygon to points
So now we’re getting into the cool stuff! Shape files are fine and all, but it’s not exactly what we need right now for this problem. We don’t really care about where all of the water is, the only part that we care about are the edges or the line that forms where the water meets the land. This is how we determine what the shoreline is, and the basis of how we’ll classify our waterfront properties.
So what we need to do is take the polygon representation of our shape file, and turn its lines into several points. The more points you create out of the line, the more accurate the measurement will be. This may sound weird but it’ll make sense in a bit I promise.
In order to do this, there’s a processing tool called "Convert polygon/line vertices to points". You’ll find this tool on the right-hand side of the window within the "Processing Toolbox". The easiest way to find it would be to just start typing "convert" in the search box and let it filter it out for you. This toolbox has plenty of other useful tools in here as well so feel free to poke and prod around, your situation may not be exactly like mine so a better tool may be more appropriate for your use case.

All you have to do is double-click the tool and it will bring up a dialogue box. Here, just make sure the layer you want to convert is selected and hit "Run".

And there you go! I know it looks a bit messy. At any point in time, you can toggle the checkboxes on the bottom left-hand side and remove some of these layers from view if things get too cluttered for you.

Calculate distance
Now thanks to our data we know where each house is, and thanks to QGIS and our shapefile from USGS, we know where the shoreline is! Here we’ll use the same "Processing Toolbox" and search for "distance". The tool we’re looking for is under "Vector analysis" and is called "Distance to nearest hub (line to hub)". Here we have a few more "parameters" in a sense. I used our housing location as our source layer, and the output of the last step (points) as our destination layer. I then selected ID as the name attribute, and "Miles" as my unit of measurement. Hit "Run", and soon you’ll see a whole bunch of lines spanning out from each house to the closest point of water! Now this shows you why it’s important to transform our polygon into vertices. The function won’t know where to measure its distance to as a polygon, so breaking down that shape into multiple vertices at small intervals gives us a point to connect to from each house and find the closest one.

Again, don’t forget to change the color with "Style" if you’re not a fan of the default color. Ok! This is good right!? Well, mostly…

Cleaning
If you look you’ll notice that our shape file had a bunch of small little lakes, ponds, and puddles in the data. That’s not necessarily wrong, it’s just not exactly applicable for our use case, since you can’t really list your house for sale as a "waterfront property" just because it’s next to a puddle, I don’t care how you spin it!
So now this step involves a bit of cleaning. Look, I hear you, "Cleaning?! I just got done cleaning this entire data set! Now I have to clean this too?!!"
Short answer? Yup.
Less short answer? It depends on your data and what you need.
Chill though, it won’t be that bad I promise.
There are a few steps here but let me break it down.
- First, we want to make sure we have the correct layer selected. What we’re doing is removing the points that represent the borders of the small ponds or lakes, so in this case, we’ll select the "Points" layer, as that was the output of our "Convert polygon/line vertices to points" function. You’ll know the layer is selected when it’s underlined. I want to emphasize that the checkbox only toggles what is displayed on screen and is different from selecting an active layer to edit.
- Next click on the "Edit" pencil in the toolbar towards the top.
- Then click the "Select features by…" icon below that on the left. There’s a drop-down menu here that you can work with if you want to choose a different selection method but we’ll just start with this default for now.

Now if you did everything right, you’ll be able to click and drag to highlight groups of points on the map. When points are selected you can then delete them with the "Delete Selected" button that will light up.

To select multiple groups at a time you can hold down the "Shift" key while clicking and dragging to delete points in larger batches. Now the current view looks off because the distances are still referencing the old points that we’ve deleted. Once we clean up all the points we want to get rid of, we’ll run the same "Distance to hub (line to hub)" function we ran earlier to recalculate the distances from each house to our relevant water source.

This is an example of how the distances are recalculated after only removing a small portion of the unnecessary points. Both layers are selected to show the difference between the two; however, once you’ve cleaned all the points, you’ll no longer need the initial distance layer and can either uncheck it so it doesn’t display or right-click and select "Remove Layer" to clean up your project file. Also, you can click on "Pan Map" (the hand icon in the second row on the top toolbar) to go back to navigating around the map like before instead of selecting points.
Great job getting through that cleaning!
Joining layers
Now at this point, we actually have all the data we wanted to create in the first place! You may not be able to see it right now but it’s there, and we’ve visualized it with lines on the map! The goal at this point is to join our new distance data with our main housing data and export it so we can manipulate it as needed with Python later. To do this we’ll want to double-click our data layer and select "Joins" from the column of tabs on the left and click on the small green plus sign icon to add a join. In the pop-up window, make sure the distance layer is selected as your join layer, and I used the ID as my "Join" and "Target" fields. Since in this case, we only want to add the distance to our main dataset, I selected the "Joined Fields" box to select only the data we want to include in our join.

Export
Lastly, all you need to do now is right-click on your data layer that now has the distance information you just created, and navigate to Export > Save Feature As. Here you can select your file type as CSV (or whatever filetype you prefer to work with), add a name, and hit "Ok"!
Celebrate!
Congratulations! Go ahead and enjoy a celebratory beverage, do a dance, or do both! Now we have a CSV that includes all of our original data, combined with a column that represents the distance in miles from that house to the closest shoreline. This is just the first feature I created for my project. Depending on your use case, you can go ahead and start working with this file elsewhere or stick around in QGIS to create more features!
Next Steps
After I’ve created and exported all of my features, I was able to go back into my text editor and read in this CSV as a Pandas data frame to work with it further. I’m not including the Python portion in the walkthrough here since there’s an abundance of fantastic resources out there already and this was to show how you can use another platform in addition to your code to make it better! I ended up labeling houses as "Waterfront Homes" if the house was within .25 miles from the shore. This would account for the strip of land that no homes can be built on, street separations, and/or popular boardwalks. I also decided to keep Green Lake because I know this is a desirable location that can affect housing prices, as well as Lake Sammamish since it is a considerable body of water. Here’s a picture showing the outcome of the new data plotted on a map. This also shows my work with the breweries and the hospital that I’ll include in later parts of this series!

Recap
If you’ve made it this far, I hope this walkthrough has been helpful for you. We went through a specific problem but these tools can be used to tackle a wide variety of issues and we covered a lot of material to help get you started with your next project. We learned how to load data into a new software platform (QGIS) and learned about a useful resource to gather GIS data (USGS). We also went through how to transform (Vector Tools) these files into polygons to points so we can use them to work with our existing data, and also delete points (Select/Delete Selection) that don’t pertain to our use case. Finally, we learned how to use another tool to create a brand new feature (Distance to hub) and extract it (Join/Export) into a digestible format to strengthen our analysis and give it more depth.
In part 2 of this project, I’ll outline how to use Network Analysis within QGIS to determine commute times between each house and the UW hospital to better determine where we would want to buy a house!
This is my first project using QGIS so if you’re a veteran, I’d be interested to find out what other solutions you might have to tackle this problem! Any discussion is welcome, and I hope this enables other fledging GIS analysts to pick up this valuable tool and include it within their Data Science toolbox.