It was nearing the end of November. Turkeys, politics, and Covid-19 were the only prevalent topics of conversation. I found myself filled with way more than my usual amount of anxiety as all around me, everything seemed to fall somewhere in between chaotic and catastrophic. And yet, one of the most pressing matters on my mind was: Who would be my teammates for my Flatiron Data Science Phase Two project?
Now, if this seems a bit overdramatic to you, allow me to quickly explain. You see, when it comes to schoolwork, I’m a lone wolf. (Okay, in truth I’m not a wolf at all; rather, more like a lone llama or a flamingo. Something significantly less fierce and notably more awkward than a wolf. But I digress.) I’m a friendly person, social enough, I’m just used to working on my own. And so far, flying solo had produced excellent results for me. However, I knew that my future career in data science would depend on my ability to work well with others, to trust my coworkers, and to be comfortable with the idea of speaking my mind and contributing my thoughts and input. I was encouraged to embrace group work and to learn from the experience.
So I held my breath, waiting impatiently for my instructor to announce who my group would be. As she went over the criteria for the project and what would be required of us, I tapped my fingernails impatiently on my makeshift desk and scrolled again and again through the list of my cohort-mates. The thing about this program is that we do not typically see each other’s work, so I actually have no idea what kind of student anyone else may be; how strong of a coder someone is; or how helpful, how patient, they would be as a team member. I was nervous because it was all so unknown, and I was terrified to put my trust in the unknown with this important steppingstone to my future career. My only real desire was that I be placed with people who I’d had some sort of interaction with before. Not a group of strangers. Not a group of names I didn’t recognize. I was only seeking familiarity at this point, and nothing more.
I got lucky. Really, really lucky. When the groups were announced, I smiled with relief. Two friendly people, who although I didn’t know well, I had spoken to previously over Slack and Zoom. One who had made me feel better about an accidental mishap with Slack at the onset of the program, regarding an inappropriately placed gif of two cats kissing each other good morning, and another who I’d previously worked on some combinatorics with and also had bonded with over blog posts. As if that weren’t enough, I quickly learned that they both shared my enthusiasm and dedication to the work, and that they were both as excited as I was to take on this next challenge.
The moment that the groups were assigned, my teammates and I began messaging each other in regards to scheduling and planning out the project. As soon as our study group ended that night, we met on Zoom to begin delegating tasks and discussing our ideas. Our goal for this project was to utilize Linear Regression modeling to predict housing prices in King County, Washington based on the dataset that we had been given, containing over 20,000 houses. When we first discussed possible features that could be used to predict our target, we each contributed our own unique ideas about direction.
"Well, I’m a mom, so I immediately think ‘schools’," I said with a smile. "School district was the primary criteria that we used to narrow down where we would live when we were in the market for a house."
"Definitely!" agreed my teammate. "Location is important. Maybe being close to transit stations or parks could play a role, too."
"Yeah," said my other teammate, "and this is going to sound weird, but I’ve heard that there’s a connection between house prices and distance to scientology churches."
We all laughed. Scientology churches? As strange as it sounded, we decided to give it a shot and see if there was any correlation there. We also had the idea of investigating distances to a Starbucks while we were at it, just as another location-based feature.
The data obtaining and preparation process utilized many of the same techniques used in our Phase One projects: importing of CSV’s, web-scraping, API calls, and the like. But the difference this time is that we had to take our skills further by going well beyond EDA, to fitting actual models. The downside of this is that, although we collected a lot of data between the three of us, only a portion of it ended up being useful enough to be retained in our final model.

For our features, we kept square-footage of living space as well as building grade from the original dataset we were provided, and obtained the rest of the data on our own. Some of it was relatively easy and straightforward, like the top schools having a strong relationship to house prices; others were more difficult or impossible to find a connection to. Some features that didn’t make it into our final model were parks, transit, and Starbucks locations. We weren’t able to find significant correlations between home prices and proximity to these places.

However, we were able to find a significant correlation between house prices and proximity to one of the top ten coffee shops in King County (none of which were Starbucks). This project involved a lot of narrowing down our scope and pinpointing the data that would illustrate a linear relationship. The process was tedious at times, and it was a monumental letdown when the data did not show what we hoped it would.
For instance, with the coffee shops, the steps I had to take to obtain the data and begin to analyze the relationship were:
- Make API calls to find all the Starbucks locations in King County (there were many),
- Convert all of these locations from their JSON format into a DataFrame,
- Create and run a function to find the distance (using haversine formula) between each house’s coordinates and every Starbucks location in the county,
- Run a function to, for each house, pick the closest Starbucks location by the distances calculated above,
- Plot the relationship between price and distance to a Starbucks,
- And run a .corr() on the data to see if there was any notable correlation.
Each feature we explored needed to be put to the test with these same or similar steps, and this was a time-consuming process with very little reward at the onset. It was disappointing to go through all of that only to conclude there was no correlation to be found. But persistence is key, and we were determined to find some unique features to predict our target. So my next move here was to investigate the distance from each house to the top 50 coffee shops in King County (as per Yelp rating) to see if there could possibly be a connection there instead.
I again went through all of these steps only to find that, once again, there was no visible connection. But then, a magical thing happened. On my final attempt at this coffee shop idea, I narrowed my scope to only Yelp’s top 10 coffee shops in King County, and I was able to find a significant correlation between this feature and house prices! And the correlation was pretty strong at -0.48. (Negative because as distance to a highly-rated coffee shop decreases, house price increases. In other words, the closer you are to a great coffee shop in King County, the higher the price of your house tends to be.)

Despite all the many setbacks, it was comforting to know that as I was working tirelessly on this, my other team members were putting in the same effort towards investigating their own features for our model. While Dana was web-scraping for transit locations and investigating distance to parks, Matt was taking the scientology idea and running with it. Between the three of us, we were able to obtain, prepare, and narrow our data enough to create a model with 76% accuracy at predicting home prices in King County.


The features that made it to our final model were: square-footage of living space, proximity to top schools, building grade, proximity to scientology churches, interaction between schools and scientology churches, and proximity to highly-ranked coffee shops. As can be seen above, all of these features boasted a p-value of less than our alpha (significance level) of 0.05. Yes, even proximity to scientology churches made the cut! By adding this feature to our model, we were able to increase our R-squared value by 3.5%. We were able to reject our null hypothesis that there was no relationship between these features and our target variable, house prices. (For our entire project, including separate Jupyter notebooks detailing every facet of the CRISP-DM method, please see our repo.)
This project led to a major turning point in my perception, as I now realize how beneficial it can be to work as a team. To know that we were all advancing towards a common goal, to learn to trust each other and believe that the job will get done, and to be able to communicate regularly about our progress and our expectations. I genuinely looked forward to our daily Slack chats and hearing all of the many ideas and breakthroughs my teammates had. Seeing this project come together before our eyes, and watching it take on new forms and new directions, was the total opposite of what I thought it all would be. I feared more stress from a group project; what I found instead was a novel sense of calm and enjoyment. I had my specific tasks and knew that I could trust my team to do theirs, instead of losing sleep over having to do everything all by myself. I think I’ve been converted to a pack animal. I’m a camel now. A reindeer. A yak.
Now that we are moving on to the next phase of our program, I have this uplifting sense of feeling less alone in all of this. It can be difficult, with virtual classes and the isolation of this current pandemic, to feel any semblance of companionship or community in these rigorous online courses. I value the fact that we have been able to come together on this and work collectively to create a successful project, despite all of these many challenges. I can honestly say that I will miss our regular chats, seeing my teammates’ cool office spaces and pets, and pretending to be King County Developers together. I’ve gained the knowledge and confidence that I can excel as part of a team, and I’ll be ready and waiting for the next collaborative endeavor!