How to Teach Yourself Data Science in 2021

Part 1 — SQL, Python, R and Data Visualization

Travis Tang
Towards Data Science

--

Recently, I graduated from Chemical Engineering and got in my first role as a data analyst in a tech company. I documented my journey here from Chemical Engineering to Data Science. Since then, as I spoke to students from my school about the move, many expressed the same interest and the same question…

‘How did you move from engineering to data science?’

That was the exact question I ask myself — how can I make the move? That same thought bugged for and pushed to start pursuing the skills of a data scientist a little over a year ago.

It certainly wasn’t the dearth of information that makes the research difficult. Quite the opposite, the deluge of resources to learn data science makes it hard to weed out the best resources from the average ones.

So many choices, so little time… Photo by Fahrul Azmi on Unsplash

But first, let’s understand…

What is data science?

Ah, this is a difficult question to answer that stumble hiring managers and interviewees alike. Fact is, different companies define data science differently, making the term ambiguous and somewhat elusive. Some say that it is programming, others contend that it’s mathematics, while others say it is about understanding the data. Turns out, they’re all somewhat correct. To me, the definition that I agree most is this —

Data science is the inter-disciplinary field that uses techniques and theories drawn from the fields of mathematics, computer science, domain knowledge. [1]

Data Science as the Intersection of Multiple Disciplines. Illustration by author.

To me, this is how data science looks like in an image. I illustrated the boundaries between each segment of knowledge as blurred to demonstrate my perception that knowledge from each of these fields blend together to form what is known as ‘data science’.

Okay, so how do I learn data science?

In this series of blog posts, I would like to highlight some of the classes that I have taken along the journey, along with their pros and cons. Through that, I hope to help people who were in my shoes in planning their self-learning journey in data science. These posts are:

  • Part 1 — Data Processing with SQL, Python and R (you’re here!)
  • Part 2 — Mathematics, Probability and Statistics
  • Part 3 — Computer Science Fundamentals
  • Part 4 — Machine Learning (read it here!)

In this post, I will highlight how I learnt about the Data Processing knowledge required of a data scientist. In order to process data, one generally needs to learn to

  1. extract data from a database using SQL (Standard Query Language), and
  2. clean, manipulate, analyze data (typically using Python and/or R)
  3. visualize data effectively.

1. Data Extraction with SQL

SQL is the language to communicate with a database where the data lives. If the data is a treasure buried underground, then SQL is the shovel to dig up the raw form of the treasure. More concretely, it allows us extract information from one or a combination of several tables in the database.

Mastering SQL need not be hard. Photo by Caspar Camille Rubin on Unsplash

There are many different ‘flavours’ of SQL, like SQL Server, PostgreSQL, Oracle, MySQL and SQLite. Each of these differs slightly, but the syntax remains largely similar and you need not worry about which flavour of SQL you’re learning.

To learn a language, one first learns the words before combining them into sentences and then paragraphs. The same can be said for SQL.

To learn the very basic concepts (the words or sentences of SQL), I used Datacamp (Introduction to SQL) and Dataquest (SQL Fundamentals). (I will go through the pros and cons of Datacamp and Dataquest later.) In general, these sites go through the essential SQL skills with illuminating exercises and examples. Some concepts covered are:

  • SELECT and WHERE for filtering and selection
  • COUNT, SUM, MAX, GROUP BY, HAVING for aggregating data
  • DISTINCT, COUNT DISTINCT for producing useful distinct lists and distinct aggregates
  • OUTER (e.g. LEFT) and INNER JOIN when/where to use them
  • Strings and time conversions
  • UNION and UNION ALL.

(You might not know these know, but that’s perfectly fine! This is just a list of things that you can expect to learn.)

However, being able to complete these exercises did not sufficiently prepare me as an analyst. I was able to understand words and sentences, but I was nowhere near writing a full paragraph. In particular, some salient intermediate and advanced concepts like sub-queries and window functions are either absent or not extensively covered, though they had been tested in several technical interviews and is essential for my current role as an analyst. These skills include

  • Handling NULL with COALESCE
  • Sub-queries and their impact on the query’s efficiency
  • Temporary tables
  • Self joins
  • Window functions like PARTITION, LEAD, LAG
  • User defined functions
  • Use of indexes in querying to make operations faster.

To learn these skills, I mainly focused on using SQLZoo.net, which is free and provide very challenging exercises for each concept. My favourite feature of SQLZoo is the fact that it has exercises that test the different concepts in one integrated question. For instance, one is provided with the following entity-relationship diagram and is asked to create complex queries based on it.

Example of a Entity Relationship Diagram. Photo by Ottomachin under the Creative Commons License

This is close to what we encounter at work as an analyst — we use different techniques that we’ve learnt to extract information from the same database. The following is the entity-relationship diagram of the SQLZoo question ‘Help Desk’. Given this, you’re asked to show the manager and number of calls received for each hour of the day on 2017–08–12. (Try it yourself here!)

Other resources that I used include Zachary Thomas’ SQL Questions and Leetcode.

2. Data Manipulation with R and Python

To start learning about the programming and the tools needed for data science, one cannot run away from R and/or Python. They are very popular programming languages which are used for data manipulation, visualization and wrangling. The question or R vs Python is an age-old question that deserves another post on its own. My take?

It doesn’t matter whether you pick R or python— once you master one, you can easily pick up the other.

My journey with coding in python and R started with the code-along-with-me sites like CodeAcademy, Datacamp, Dataquest, SoloLearn and Udemy. These sites provide you with the self-paced classes organized by languages or packages. Each breaks concepts down into digestible parts, and gives the user with starter code to fill in the blanks. These sites typically walk you through a simple demonstration, and you will get a chance to practice the concept immediately afterwards by exercises. Some offer project-based exercises afterwards.

Today, I will focus on two of my favourites, Datacamp and Dataquest.

Please keep in mind — down below you’ll find an affiliate link to the courses. That doesn’t mean anything to you, as the price is identical, but I’ll get a small commission if you decide to make a purchase.

DataCamp

DataCamp offers video lectures taught by professionals in the field and fill-in-the-blank exercises. The video lectures are mostly succinct and efficient.

Image by author

One part I love about DataCamp is the up-to-date courses that are organized into career paths in SQL, R and python. This takes away the pain of planning your curriculum — now you only need to follow your path of interest. Some of the paths include:

  • Data Science in Python/R
  • Data Analyst in Python/R/SQL
  • Statistician in R
  • Machine Learning Scientist in Python/R
  • Python/R programmer

Personally, I started my R education with Data Science in R, which provided a rather detailed introduction to the tidyverse in R, which is a collection of incredibly useful data packages to organize, manipulate and visualize data, which most notably includes ggplot2 (for data visualization), dplyr (for data manipulation) and stringr (for string manipulation).

My favorite packages in R. Image by author.

However, I do have my complaint about DataCamp — that is the poor retention of information after completing DataCamp. With the fill-in-the-blank format, it is easy to guess what is needed in the blank without really understanding the concept. When I was a student on the platform, I tried completing as many courses as I could in the shortest possible time. I skimmed through the code and filled in the blanks without understanding the bigger picture. If I could restart my learning on DataCamp again, I would take my time in digesting and understanding the code better as a whole, not just the parts that I was asked to fill.

DataQuest

Image by author

Dataquest is very similar to DataCamp. It focuses on using code-along exercises to illuminate programming concepts. Like Datacamp, it offers a wide variety of courses in R, Python and SQL, though it is somewhat less extensive than those in DataCamp. For instance, However, unlike Datacamp, Dataquest does not offer video lectures.

Some of the tracks offered by Dataquest includes:

  • Data Analyst in R/Python
  • Data Science in Python
  • Data Engineering

DataQuest’s content is generally more difficult than those in DataCamp. There were also fewer ‘fill-in-the-blank’ format exercises. Though it took longer, my knowledge retention on DataQuest was better.

Another great feature about the DataQuest is the monthly call with a mentor who will review your resume and provide technical guidance. While I did not personally get in touch with a mentor, I would have in hindsight, since it would definitely have helped me progress much faster.

3. Data Visualization

Data visualization is the key to present the insights you drew from your data. After learning the technical skills of creating charts using python and R, I learnt the principles of data visualization from a book, Storytelling with Data by Cole Knaflic.

Sending a message with numbers. Photo by Alexander Sinn on Unsplash

This book is platform-agnotistic. In other words, it does not focus on any particular software but teaches the general principles of data visualization with enlightening examples. Some of the key pointers you can expect to learn from this book are:

  • Understand the context
  • Choose an effective visual
  • Eliminate clutter
  • Draw attention where you want it
  • Think like a designer
  • Tell a story

I thought I knew data visualization, until I read this book.

After digesting the book, I was able to create a (somewhat) visually pleasing chart that address the police brutality against blacks. One of the main learning points from the book applied here was to draw attention where you want it. This was done by highlighting the African American line with a bright yellow — reminiscent of the BLM color — while ensuring that the rest of the chart remained in the background with duller shades like white and grey.

Data visualization techniques applied to a chart that highlights police brutality. Image by author.

Next Steps

In this post, I covered the steps I’ve taken in learning programming from scratch. With these courses, you now have the necessary skills to manipulate data! However, there is still a pretty long way to go. In the next posts, I will cover

If you have any questions, feel free to connect with me on LinkedIn. All the best, and good luck!

Other Readings

If you enjoyed this blog post, feel free to read my other articles on Machine Learning:

Translation

References

[1] Dhar, V. (2013). “Data science and prediction”. Communications of the ACM. 56 (12): 64–73. doi:10.1145/2500499. S2CID 6107147. Archived from the original on 9 November 2014. Retrieved 2 September 2015.

--

--