You can’t just Google everything

And other things I wish I knew before I started my latest Data Science project

Zachary Ang
Towards Data Science

--

Photo by Chris Ried on Unsplash

I got started on a new data science project a few weeks ago and expectedly, it’s been h̶o̶r̶r̶i̶f̶i̶c̶ absolutely enlightening. I’ve made a few tentative explorations into data science and machine learning projects before. But I thought it was time to do something bigger. But where would I start?

You can’t just Google everything

I made the mistake of looking for inspiration where, to be fair, most people would — the front page of Google. As it turns out, there are way too many resources out there that you could get lost in. Some tips are a little underwhelming. Take this one from a Quora post I found:

I fell asleep when I saw “cricket”

That didn’t sound right. Why should I be looking at data that merely “mimicked” real world application of data science? I wanted to build a real portfolio, solve real problems and make a real difference to businesses and people. I wanted to analyse something I could be passionate about. These suggested datasets are great, no doubt. They’d probably be a great test of your data exploration, machine learning or visualisation skills. But what would it all count for, ultimately?

I wanted to build a real portfolio, solve real problems and make a real difference to businesses and people.

Instead, one of the most useful resources I’ve turned to are the writers on Towards Data Science. Michael Galarnyk wrote a great piece on building a data science portfolio and Jeremie Harris wrote a great “don’t-do” list for aspiring data scientists. This part spoke to me the best:

When in doubt, here are some projects that hurt you more than they help you:

* Survival classification on the Titanic dataset.

* Hand-written digit classification on the MNIST dataset.

* Flower species classification using the iris dataset.

I have mixed feelings about Kaggle and their competitions. I think data scientists should possess the ability to wrangle data from all kinds of places unimaginable. I don’t really like the idea of being ‘spoon-fed’ the data. That being said…

Collecting your own data is painful. Really.

So I finally decided to collect my own data. The project was going to be about predicting a movie’s iTunes list price from its characteristics, eg. box office performance, critic ratings, cast, plot etc. Models on predicting success of a movie were not new. But some questions hadn’t been asked, nor answered.

  • After a movie is released, how much would a movie be able to list for on a Video on Demand (VOD) platform?
  • Can I forecast this price in advance so I can decide if I want to pay for it?
  • Or could another VOD provider use this information to compete with iTunes on price?
  • Can a Subscription Video on Demand provider like Netflix use this information to optimise their pricing strategy?

Much better than looking at cricket data.

But very few datasets provided the connection between movies and their list price on iTunes, or other VOD platforms.

Eventually, I turned to open source movie databases such as The Movie Database and Open Movie Database, which I combined with the iTunes data to analyse the data. This was not without its challenges though. You can look at my script and exploratory analysis on Github, but stay tuned for a blog post on my whole data collection process soon.

Everyone who has valuable data uses an API

I’m no expert in internet protocol, server scripting or database architecture and engineering. But I don’t think I’m far off the mark with this one.

I relied heavily on using APIs through my data collection process, but I also used web scraping techniques to gather my data. I found myself making much less queries to the website/server using an API, which provided me a way of making a structured query to get all the data I needed. Nothing superfluous or noisy.

It’s important for organisations to understand why an API layer will benefit them:

  • You can protect your servers that host your information.
  • You’ll protect the data that powers your business.
  • It helps your partners and collaborators get the data and information that they need, easily and safely.
  • You can manage your server resources by tweaking rate limits.
  • You are in ultimate control of how you want data to be coming in and out of your servers.

The analogy of an API being a a bartender is particularly useful. Instead of opening up your data ‘buffet’ style, where people would tend to adopt a grab-first-ask-questions-later mentality, it’s much better to have a guy to control how much food gets consumed. Pretty neat stuff.

Thanks for reading my first blog post. I’ll be posting more updates as my project progresses and thoughts bubble to the surface.

--

--