Photo by Nikko Macaspac on Unsplash

This will make you understand how hard Data Science really is

Thijs Essens
Towards Data Science
8 min readJun 30, 2020

--

tl;dr: it’s very hard

Every day, people try to live up to computer science savants and try to break into the field of Data Science.

How hard is it? What does it take? Where do I start?

In this blog I’ll summarize the 3 hardest challenges I faced doing my first Data Science project in this Kaggle Notebook:

  1. You know nothing
  2. Data preparation is critical and time-consuming
  3. Interpret your results

Opinions are my own.

Before getting into any details, there is quite an essential part that people seem to gloss over during explanations, or they’re simply part of their small snippet codes. In order for you to use any of the advanced libraries, you are going to have to import them into your workspace. It is best to collect them at the top of your workbook.

My example below:

You know nothing

For my first Data Science project, I created a short blog on starting an Airbnb in Amsterdam. I only used basic data analysis methods and regression models.

Regression models are probably the most basic parts of data science. A typical linear regression function will look like this:

This will return several outputs that you will then use to evaluate your model performance.

There are more than 40 techniques used by data scientists. This means I only used 2.5% of all the models out there.

I’ll be generous. Given my statistics course in University 8 years ago, going into this project I knew about 10% of that regression model already. That means I knew 0.25% of the entire body of knowledge that I know is out there. Then add a very large amount of things I don’t know I don’t know.

My knowledge universe in Data Science looks something like this:

Image by Author

Like that isn’t bad enough, you will find articles like these, exactly describing all of your shortcomings.

This current project took me about 4 weeks and let’s say that’s a pretty average rate of learning new data science models. It will take me about 4 / 0.25% = 800 weeks to learn all the models I have heard of so far and probably add another 5 times that time to learn (probably not even close to) everything in the data science field.

Between 15 and 75 years of learning ahead.

Me after my learning:

Photo by Donald Teel on Unsplash

Data prep

I’ve worked with data for over 5 years at Google.

Too bad all my previous experience is in SQL and Data Scientists are big fans of Pandas. They’re just such animal lovers.

The challenge here is two-fold: 1) Knowing what to do, 2) Knowing how to do them.

Even with the help described below, the data preparation part takes about 80% of your time or more.

Knowing what to do

The ways to manipulate your data to be ready for ingestion in your models are endless. It goes into the deep underbelly of Statistics and you will need to understand this thoroughly if you want to be a great Data Scientist.

Be prepared to run through these steps many times. I’ll give a couple of examples that have worked for me for each step.

Clean data quality issues

Your data sample size permitting you should probably get rid of any NaN values in your data. They can not be ingested by the regression model. To find the ratio of NaN values per column, use this:

np.sum(df.isnull())/df.shape[0]

To drop all rows with NaN values use this:

df.dropna()

Another data quality problem I ran into was having True and False data as strings instead of a Boolean data type. I solved it using this distutils.util.strtobool function:

Please don’t assume I actually knew how to use Lambda functions before starting this project.

It took me a lot of reading to understand them a little bit. I really liked this article on “What are lambda functions in python and why you should start using them right now”.

Finally, my price data was a string with $ signs and commas. I couldn’t find an elegant way to code the solution, so I botched my way into this:

Cut outliers

Check if there are any outliers in the first place, a boxplot can be very helpful:

sns.boxplot(x=df[‘price’])

Get fancy and use a modified z value (answer from StackOverflow) to cut your outliers:

Or enforce hardcoded conditions to your liking, like so:

df = df[df['price'] < 300]

Normalization & combine variables

Not to be confused with a normal distribution.

Normalization is the process of scaling individual columns the same order of magnitude, e.g. 0 to 1.

Therefore, you have to only consider this move if you want to combine certain variables or you know it affects your model otherwise.

There is a preprocessing library available as part of Sklearn. It can be instrumental in delivering on some of these data preparation aspects.

I found it hard to normalize certain columns and then neatly put them back into my DataFrame. Therefore I’ll share my own full example of combining beds/bedrooms/accommodates variables below:

Create normal distribution for your variables

It’s useful to look at all the data input variables individually and decide how to transform them to better fit a normal distribution.

I’ve used this for loop:

Once you know which you want to transform, consider using a box cox transformation like this:

Note how it is important to also return the list of lambdas that were used by the box cox transformation. You’ll need those to inverse your coefficients when you are interpreting the results.

Standardization

The StandardScaler assumes your data is already normally distributed (see box cox transformation example above) within each feature and will scale them such that the distribution is now centered around 0, with a standard deviation of 1.

My full example below:

Create dummy variables

Your categorical variables can contain some of the most valuable information in your dataset. For a machine to understand them, you need to translate every single unique value to one column with 0 or 1.

1 if it is that value, 0 if it isn’t.

There are several ways of doing this. I’ve used the following method:

Multicollinearity

You can check if some of your input variables have a correlation between themselves before you put them in your model.

You can check multicollinearity like my example below and try to take out some variables you don’t believe add any extra value.

This will hopefully prevent some Eigenvalue and Eigenvector issues.

Getting to the bottom of eigenvalue and eigenvector might cost you another 15 years. Proceed with caution.

You could go down the rabbit hole of dimensionality reduction through Principal Components Analysis (which is more commonly applied to image recognition).

Additionally, this article on Feature Selection was very eye-opening.

I frankly didn’t have time for all of these advanced methods during this project, but I want to understand all of the above better before applying data science to my work projects.

Knowing how to do them

As a Dutch person, I am very proud of the origins of Python. It’s an incredibly powerful and accessible language.

The basics should take around 6–8 weeks on average according to this post.

Therefore, I haven’t even learned to basics yet with my 4 weeks of experience. On top of that weakness, I’m trying to add a couple of different specialized library packages on top, like Seaborn, Sklearn, and Pandas.

“They are foundational, regardless of what project you do.“

Thank you for all the Stackoverflow and self-publishing engineers out there, you saved many hours of my life.

In particular, Jim is my new best friend.

Copy code and learn from all those articles, they are your best hope of working code at the end of the day.

Let me summarize the three most useful articles on the above three libraries:

  1. Seaborn: How to use Python Seaborn for Exploratory Data Analysis
  2. Pandas: A Quick Introduction to the “Pandas” Python Library
  3. Sklearn: A Brief Tour of Scikit-learn (Sklearn)

Spend some time with your new friends:

Photo by chuttersnap on Unsplash

Interpreting your results

In the previous step, you have most likely manipulated your data to an extent that it becomes unrecognizable for interpretation.

As per my project article, the listing price was boxcox transformed and 1 dollar in the output did not resemble a real dollar.

The easy way out for me was to take a measure of the relative strength of the relation with listing price, instead of commenting on the actual value of the coefficients. You can see the results in my visuals.

The more proper solution is to inverse your box cox and other transformations back to normal, e.g.:

Inv_BoxCox for the boxcox transformation

from scipy.special import boxcox, inv_boxcox
>>> y = boxcox([1, 4, 10], 2.5)
>>> inv_boxcox(y, 2.5)

Hopefully, you remembered to save that lambda_list in my box cox transformation example. Because the 2.5 is your lambda value to inverse the transformation.

Inverse Transform for StandardScaler:

Finally, you need to get extremely well versed in your evaluation metrics.

There are many evaluation metrics out there, like RMSE and coefficients / p-values.

Most of the machine learning models come with evaluation metrics. The same is true for LinearRegression in Sklearn: metrics.

In my example, I’ve focused on R-squared and RMSE:

One thing that made looking at RMSE very valuable was equaling the playing field between models.

I could quickly tell that throwing automated machine learning at the problem didn’t improve my model very much. H2O AutoML was reasonably easy to implement:

This thing was wild! It runs through a number (max=20 for my example) of different models and returns a leaderboard of models.

However, as said, this really didn’t improve my RMSE by much.

“It still all depends on the richness of your input”

Please also be very careful and solicit the help of more knowledgable people where available.

Use your resources and call a friend:

Photo by Louis Hansel on Unsplash

Conclusion

In this article, I tried to make you understand how hard Data Science really is.

  1. We took a look at the amount of learning I still need to do. Only about 15 to 75 years…
  2. Then we looked at data preparation and figured out it’s a huge headache and will take about 80% of your time
  3. Finally, we found out that interpreting your results is probably best left to a more knowledgable person. Call a friend!

Staying true to the growth mindset, let this not deter you.

You are just not a Data Scientist YET.

Start your first project!

--

--