My First Data Science Project

Published in

Towards Data Science

7 min readNov 18, 2019

Why am I writing the post

As one who is in the Data Science field for a while, I received quite a few questions from enthusiastic graduating students asking “how to become a data scientist” or “what do data scientists do on a daily basis”. On the other side, I occasionally hear some voices joking the statement “Data Scientist is the Sexiest Job in the 21st Century” is not true anymore: most of the data scientists time could be spent on cumbersome tasks like run SQL to pull data, building dashboards to do reporting, and (more frustrating) you need to maintain them! They ask what is my experience as a Data Scientist, how do I feel about this phenomenon, and whether Data Science is still “sexy” after 7 years the original statement was made. Many similar questions were around the theme and I think it could be helpful to share my personal view through a story about my first data science project. This could serve as a point of reference, and hopefully, it could be also inspirational, for those who are still aspiring to join the field, or those who just start their first Data Science job, or even those who are in the field for a long time.

Note. this is not about my actual first project as a data scientist, as you know to get started with any job, one needs to do trivial tasks at the beginning to understand the basics. This project is one I did in my third month after started my data science career at TrueCar, Inc. back in 2013.

The story

After graduate school, I joined TrueCar to start my first Data Scientist job. For those who don’t know, TrueCar is an online car buying platform that provides transparent pricing information (e.g. recent vehicle transaction prices), so customers don’t overpay for a car. Back then, as a fresh graduate, I was fascinated by all projects the data science team was working on, it covers literally all the business aspects a data-focused technology company run upon: build pricing models to support core products, optimize website with A/B testing, etc. After the first two-month onboarding period, I was assigned to work in the search engine marketing area.

Search engine marketing (SEM) is a game search engine giants, like Google, invent. An over-simplified view for the game is: companies decide the bidding price of a keyword (e.g. “new toyota corolla” would be a keyword with 3 words length). If a user searches the exact keyword, and the company’s bidding is higher than other competitors, it would win the slot and can put their product information right on top of all search results the user would see. If a user clicks the product link, the company pays money, usually the second-highest bidding offer, to the search engine giant. TrueCar has been using SEM for several years to help with customer acquisition, and there were already hundreds of thousands of keywords in the system to maintain, organized by different campaigns. My assigned project was to report each SEM campaign’s performance and automate the reporting using Google’s AdWords API.

I was very happy with the project as a way to practice my Python skill and learn more about AdWords API, although this does not look “sexy” at all. Anyway, I started to play around with Adwords, write some code, and soon collected a lot of keyword performance data and then used them to compile a report.

A bit of curiosity

After generated the first campaign performance report, I found the “cost per conversion” (CPC) is the metric people all care about; not surprisingly, as this metric means how much we pay for one converted customer: the lower the better. I also did some research about the SEM industry and realized that the “keyword” is a critical aspect of search engine marketing: they are not created equal, and two keywords could have dramatically different performance in terms of “cost per conversion”. Back then, two questions came on top of my mind: 1. which keywords have better performance (a lower CPC)? 2. do we have all the better performing keywords in the system already?

Some science, some algorithm

It was impossible to manually look through all the keywords in the system to get any meaningful insight, imagine there could be keywords like “i really want to buy a hyundai sonata” (yes, I just did the search in Google, so it is a keyword other companies can bid for and it has at least one search traffic). So I need to be more strategic and add some “science” component into the flow. The algorithm or process I developed was as follows:

Step 1. for each existing keyword in all campaigns, generate 1-gram and 2-gram sub-keyword. For example, the “new toyota corolla” keyword will generate the following 5 sub-keywords: “new”, “toyota”, “corolla”, “new toyota”, “toyota corolla”
Step 2. assign each sub-keyword the performance as if they are the keyword itself. For example, if “new toyota corolla” has $100 SEM spending, resulting in 10 clicks and 2 conversions; then all 5 sub-keywords would be credit with the same performance.
Step 3. group by all the sub-keywords and calculate their respective cost and conversions, then we could get their average CPC performance, with clicks # as a reference for relative sizing
Step 4. sort all sub-keyword by the CPC performance, with lower CPC on top

This is not the best algorithm for sure: there are quite a few sub-keywords with absolutely no meaning at all, like “buy a”, “the”, etc. But, there is also a few interesting sub-keyword surfaced on top of the list: one of those is “best price”. This sub-keyword has surprisingly low CPC but only received a few hundred clicks. Um …, what does this mean?

Plausible business explanation

Why “best price” performs better? Blending into some business knowledge, everything starts to make sense: people who convert well on TrueCar usually have a strong intention to “find a good deal”, and “best price” is a common way to express their interest in the search query. From this perspective, it makes sense that “best price” related keywords have a lower CPC since their conversion intention (e.g. using TrueCar product) is higher. On the other side, my analysis showed we only receive limited clicks, contributed by very few keywords, this suggests an opportunity to further expand on!

Soon I summarized the finding in a few slides to show this “hidden gem”. People were a bit suspicious of my finding given there is a very limited sample for the specific keyword, however, no one wants to refuse the opportunity to get more sales with half the cost, so I got a green light to test this idea! To avoid unnecessary technical jargon and details, in short, I created a set of keywords around “best price”, uploaded everything into AdWords, and started a new campaign. After one week, it receives a good amount of clicks with a substantially lower CPC, and everyone was very happy. After that, it was purely execution: with various ways to generate lots more keywords based on the “best price” component, we scaled up with multiple campaigns and they all perform consistently: much lower CPC than peer campaigns. This resulted in a substantial amount of cheap conversions for the company, and luckily, my first data science project had a happy ending :)

Looking back

Fast forward, many things have changed in the past 6 years. Nowadays, the SEM industry is much more intelligent, with most keyword selections automated by various tools. Just today, when I search for “best price toyota corolla”, there are four returned SEM results, and TrueCar is not one of them. So I know my insight on this specific keyword is already outdated. On the data science side, this project could not be considered glorious: the algorithm was rudimentary (simple n-gram feature extraction), the statistics lacked some rigor (a very simple way to calculate the CPC confidence interval), and the engineering process was quite hacky (definitely not my best Python code). However, I still feel proud of this project, because it gave me a taste of the value of Data; and being a Data Scientist, I could unlock it and enjoy the success.

My thoughts on Data Science

Now you have learned my first data science project story. For sure, this is just one out of thousands of ways one can start as a Data Scientist. Hopefully, this story shed lights on three elements (I think) a “data scientist” should possess:

good data analysis and engineering skills: without which you may get stuck into many steps along the way (e.g. extract n-gram, automate SEM campaign launch, etc)
reasonable business sense: data helps to unveil hidden business knowledge, so you better know some basics knowledge first.
(most importantly) your curiosity: this is what keeps you happy and could transform “cumbersome” tasks into interesting projects

To end the story, I would like to quote my previous statement about “Data Scientist”: it was made around five years ago when my (then) data science team was interviewed by a local newspaper (link), and I think the statement still holds at the moment.

“Being a data scientist requires a hybrid of statistics, programming, and business sense. It is a ‘scientist’ who can explore insights (via programming & statistics) from the data for company’s business needs.”

*Thanks for reading, and feel free to share this story with those may find it useful. The article was first published on LinkedIn (https://www.linkedin.com/pulse/my-first-data-science-project-pan-wu), and re-posted on Medium.