How to become a data scientist?

Taesun Yoo
Towards Data Science
8 min readJul 5, 2019

--

Introduction:

I am pretty sure that many of us come across the article from the Harvard Business Review back in 2012. A data scientist is a professional known as the sexiest job of the 21st century. Also, research conducted by McKinsey Global Institute back in 2013 projected that there will be approximately 425,000 and 475,000 unfilled data analytics’ positions in North America by 2018. The take-home message here is that there will be a constant stream of analytic talent will be required in all industries, where companies collect and use data for their competitive advantages.

What exactly a data scientist?

In an over-simplified description, a data scientist is a professional who can work with a large amount of data and extract analytical insights. They communicate their findings to the stakeholders (i.e., senior leadership, management, and clients). Thus, companies can benefit from making the best-informed decisions to drive their business growth and profitability (i.e., depends on the context of industries).

Why is it so hard to become a data scientist?

The nature of data science is a hybrid of many disciplines. Where it composed of different subject areas like math (i.e., statistics, calculus, etc.), database management, data visualization, programming/software engineering, domain knowledge, etc. In my opinion, this may be the primary reason why people interested in jumping into the entry-level data science career often feel completely lost. Most people don’t know where to start because you may lack in one area completely or multiple areas depend on one’s educational background and work experience.

However, the good news is that you don’t need to worry too much about it. These days, we face completely opposite side of an issue. There are simply too many resources out there to pick. So, you don’t necessarily know which one might work out best for you. In this article, I will be focused on how to become a data scientist from three perspectives.

Section 1: Where to learn data science?

Figure 1. Data Science Education Path and Job Placement Rate

Let’s get started from where to learn data science. There are three major pathways to obtain data science education from Massive Open Online Courses (MOOC), university degree/certificate and boot camp training.

Here is a sample figure which demonstrates the estimated time commitment vs. job placement success rate in each option. This provides an idea that the boot camp education can give you an edge on landing a data scientist job quicker than the other two options.

Here is a summary table provides more detail information about each education pathway. Basically, each option has pros and cons with regards to the cost, flexibility and program length. However, the best tip in making the right decision is to ask yourself what really matters most to you. For example, you have a luxury of time and want to minimize the investment cost. Or you might be a person who wants to land on a job as soon as possible even if the initial investment cost is high.

Table 1. Breakdown Analysis of Data Science Learning Path Comparison

Section 2: What to learn data science?

There are many things to learn for sure as a data scientist. Let’s start looking at data science education pathway from five major steps.

Figure 2. Data Science Learning Pathways

Step 1, catching up on the basic math related to statistics, calculus and linear algebra is a good start. This is essential as a data scientist to understand the mechanisms behind how different algorithms work. It builds intuition about how to tweak or modify algorithms for solving unique business problems. Also, knowing the statistics helps you to convert your findings from the experimental design tests (i.e., A/B testing) into key business metrics.

Step 2, data scientists must be familiar with a toolset to work with data in various environments. A toolset contains a combination of SQL, command line, coding and cloud tool. Here is a summary of how each tool is used. For data extraction and manipulation from the relational databases, SQL is the fundamental language used in almost anywhere. For general programming purposes (i.e., functions, for loops, iterations, etc.), Python is a good choice since it already packaged with many libraries (i.e., visualization, machine learning, etc.). For an additional boost, knowing command lines provide extra benefits especially for running jobs within cloud environments.

Step 3, this is the best time to pick up some language for building the data science foundation. For commercial software, you have a choice between SAS or SPSS. From open source platforms, many people choose either R or Python. From here, you can grab concepts about data munging/wrangling (i.e., import data, aggregation, pivoting data and missing value treatment). After this, you have the most fun part of learning your data from data visualization (i.e., bar charts, histograms, pie chart, heat map, and map visualizations).

Step 4, you have a choice to pick between applied machine learning or big data ecosystem pathway. Note that you can always come back to master another path later. In my case, I choose to learn about the applied machine learning first. Basically, it covers the aspect of building a machine learning model from an end to end (i.e., data exploration to model deployment). For learning about the big data, I will cover more about where to obtain that education (i.e., books and courses).

Step 5, this is the most crucial step to showcase your potential as a data scientist candidate. Once you familiarize yourself with doing the data science, one must have a project portfolio. A project portfolio is your best opportunity to show what you have done from learning and work experiences. Starting from the data collection (i.e., where to pick or scrape data on your own), come up with your hypothesis, perform exploratory analysis (i.e., extract some interesting insights), build your machine learning model(s) and finally share your findings from write up or presentations. In my case, I have done both a write-up and a video podcast by working on the capstone project with an assigned mentor. I can never emphasize enough about the importance of having a mentor who can directly work with you 1 on 1. Your mentor is the best friend to guide you and ask for help when you get stuck on some project ideas, tuning your model, communicating your results, etc. In fact, some researches mentioned that having a mentor can boost your career five times more than people without a mentor(s).

Section 3: How to learn data science?

In this section, you are going to learn how to pick the best resources for becoming a data scientist. I want to make recommendations based on my learning experience.

Figure 3. Suggested Resource on Learning Data Science Education

For SQL education, the DAT201x course offered by Microsoft from the Edx is one of the best choices. The course covers the following aspects of SQL from data types, filtering, joins, aggregation (group by), window functions and advanced concepts (i.e., stored procedures). The course ensures you to practice a lot by using the best sample data warehouse (i.e., AdventureWorks). Alternatively, you can use the Mode Analytics platform to practice and enhance your SQL skills. The best thing about Mode Analytics is you don’t need to have a SQL server and sample data warehouse installed in your machine. All you need is to have a free account and Internet connection to enjoy your learning.

For machine learning education, there are two options that I like to recommend. The first course is well-known from any data science practitioners out there in the field. Andrew Ng’s machine learning course from Coursera. I used this course to understand basic concepts and tips on how to tune my machine learning models. For coding experience perspective, I would highly recommend this book called Python Machine Learning 2nd edition by Sebastian Raschka. I really think this is the best machine learning book. This book helps you understand from basic mechanisms of each algorithm, a lot of coding examples and supplemental references (i.e., research articles). The best thing about this book is that he walkthroughs how to implement each machine learning algorithm line by line with thorough explanations. This is super important as mentioned by many data scientists, one should be able to write up coding from scratch and know how to implement it. These days, there are many complex problems that you cannot solve directly by using existing libraries from Python.

Here is a full list of resources you can reference for learning each building block of the data science education.

1. Math:

· Khan Academy Math Track

· MIT Open Courseware: linear algebra and calculus

· Udacity: Intro and Inferential Statistics

2. Data Science Toolkit:

· SQL

o Edx: DAT201x — Querying with Transact SQL (*)

o Mode Analytics: SQL Tutorial (Intro to Advanced)

o WiseOwl: SQL Tutorial (Intro to Advanced) (*)

· Command Line

o Book: Data Science at Command Line

· Python Coding

o Udemy: Complete Python Bootcamp

o Book: Learn Python the Hard Way (3rd Edition)

o Book: Automate Boring Stuff with Python

3. Machine Learning:

· Coursera: Machine Learning by Andrew Ng (*)

· Coursera: Applied Machine Learning (U Michigan)

· Harvard: CS109 — Intro to Data Science (*)

· Book: Python Machine Learning (2nd Edition) by Sebastian Raschka (*)

· Book: Python Machine Learning by Example

· Book: Intro to Machine Learning with Python

4. Big Data:

· Hadoop

o Book: Hadoop The Definitive Guide

o Udacity: Intro to Hadoop and MapReduce

o IBM: Hadoop Fundamentals Learning Badge

· Spark

o Edx: UC Berkeley Spark Courses (CS105, CS120)

o Datacamp: Intro to PySpark, Building Recommendation Engine in PySpark

o Book: Learning PySpark, Advanced Analytics with Spark

Bonus Section: Ask for Help and Networking

Now, I would like to wrap up this article by providing a few more extra tips. In the beginning, as a newbie data science enthusiast, you don’t necessarily have a mentor who can guide your learning experience. Thus, you need a place to ask for opinions and feedback from the data science community. Well, the good news is that there are a couple of forums out there you can ask for help with your problems. A few websites like StackOverflow, Quora, etc. let you post your question and receive a reply to your posts.

Another tip is related to networking. This really applies to anyone who is really looking for new opportunities and build connections. In Toronto, there are many local meetups and big conferences related to data science. Try to attend events as many as you can and introduce yourself (i.e., motivation, objective, passion). Also, if you have opportunities to connect with speakers and event organizers, work to establish meaningful connections with them. I think one of the useful tactics that I learned from my experience is seeking opportunities to present my project portfolio on whatever available medium. What I meant is either opportunity to present at local meetups or even video webcast through the remote data science office hour. From this experience, I was able to learn from my silly mistakes and make improvement from one presentation to another. This brings a lot of value as a data scientist candidate to deliver an effective presentation and able to clearly communicate the analytical insights.

Thanks for reading this article. I hope to bring more enjoyable and resourceful information as I am gaining more experience in my journey of becoming a data scientist.

--

--