DATA SCIENCE EDUCATION

The Best Way to Start Learning Data Science is to Understand its Context

Learn the context first, then delve into the content

Victor Nazlukhanyan
Towards Data Science
7 min readAug 9, 2020

--

Photo by Dhaval Parmar

Table of Contents

  1. The Importance of Context Knowledge
  2. (Optional) Research Supporting Context-Based Learning
  3. The Context of Data Science
  4. Understand the Concept, not the Calculation
  5. The Context of the Sub-Disciplines
  6. Next Steps

The Importance of Context Knowledge

I made the decision to orient my career path towards data science during my senior year of university. It only took one or two research-binges before I realized the vast depth of the field in front of me. I knew eventually I’d have to understand things like the architecture of a convolutional neural network, the process of numericalization for NLP, or the underpinnings of principal component analysis. However, rather than jumping into the minutiae of these concepts in a void, I’ve always needed to develop a rock-solid contextual foundation of knowledge first. I’ll call this approach context-based learning.

What is context-based learning?

I will loosely define context-based learning as learning a concept by first focusing on its contextual elements. In other words, understanding the big picture before delving into the deep theory. It’s important to emphasize “first” in that definition, as learning the context is analogous to building the chassis of a vehicle. Although the chassis is an essential element, it is not a car, and is non-functional on its own. Rather, it is the bedrock from which the car is built. In the same way, a contextual framework is the bedrock from which technical content is laid on top of.

(Optional) Research Supporting Context-Based Learning

This style of learning leverages a fact well supported by research in the psychology of learning — humans retain knowledge most effectively by associating them to something they have a firm grasp on rather than memorizing new concepts in a void. In short, we learn by association.

The late educational psychology professor Dr. Barak Rosenshine at the University of Illinois emphasized the importance of these contextual frameworks in education in Principles of Instruction:

“When one’s knowledge on a particular topic is large and well-connected, it is easier to learn new information and prior knowledge is more readily available for use.”

The amount of background knowledge you have is also correlated to how well you comprehend new material. Therefore, to learn most efficiently, one must develop a strong foundation of background knowledge prior to delving into the details.

The Context of Data Science

So what is the background knowledge, or context, of data science? Well, I always begin context-based learning by asking a lot of questions. Specifically, I try to ask broad, conceptual questions, as opposed to detail-oriented ones.

The following is a handful of questions I first asked myself at the beginning of my data science journey, as well as the answers I provided. I want to emphasize that my answers fulfilled my context gaps of knowledge at the time. In the same way, you should answer these and other questions in a manner that relates to your educational and personal background directly.

How does data science fit into my understanding of other fields?

Data science is an interdisciplinary field that leverages math, programming, business, and domain knowledge to tackle difficult data problems. The overlap between data science and my major (cognitive science with machine learning & neural computation) rests on math (which is necessary for machine learning), programming (which provides computational functionality for the field as a whole), as well as data analysis techniques, such as those used in computational neuroscience. The “science” in data science comes from its use of various scientific methodologies, such as statistical significance.

What are the most important elements of data science, and how do they relate to one another?

All data scientists go through a process known as the “data science pipeline”, essentially a step-by-step, end-to-end process outlining the workflow of a data scientist. Acronyms like OSEMN make the basic pipeline easy to remember, but generally, pipelines vary in their subtleties. The basic structure is as follows:

  • Data Collection
  • Data Cleaning
  • Exploratory Data Analysis
  • Model Building
  • Visualization/ Model Deployment

What is machine learning? And why is machine learning so tied to data science specifically?

Machine learning (ML) is a field that studies computer science algorithms that are not traditional “closed” algorithms. Instead, ML algorithms “learn” from data. This reliance on data is what makes ML so integral to data science. ML is in the “model building” and “model deployment” category of the data science pipeline.

What are the sub-disciplines of data science?

There are many fields that contribute to data science, but the most fundamental disciplines that make up data science are computer science, statistics, machine learning, and linear algebra. Although business and domain knowledge are also critical, the academic scope of data science relies on the original sub-disciplines mentioned. Furthermore, the sub-disciplines themselves often have their own sub-disciplines, such as calculus being necessary to understand how machine learning algorithms work.

Understand the Concept, not the Calculation

One important dichotomy I discovered early on during my undergrad math studies was the distinction between calculations and conceptual understanding. For example, in the case of statistics, memorizing how to calculate this

is far less important than understanding the use case of a chi-square test statistic in testing hypotheses between categorical variables. Or, for calculus, understanding that this

describes an area underneath a quadratic curve is far more important than memorizing fancy methods to solve it by hand. (*ahem*)

I actually find building programs to be an incredibly accurate analogy of this. When learning to program, it is evidently clear early on that trying to learn every implementation of every function is impossible. A much more efficient strategy is to understand the inputs and outputs so that you may piece together snippets of code to make things work.

Image by the author

Even in the cases you don’t google or use StackOverflow, courses like fastai abstract the vast majority of implementation away so that you may build an end-to-end framework of understanding first (in fastai’s case, build an end-to-end model), and only after do you go back to try to understand the fundamental details that underlie the abstractions.

In this way, learning the concepts as opposed to the calculations is an application of context-based learning, as the contextual framework is built up so that when you do need to learn the calculations, they are compartmentalized properly.

The Context of the Sub-Disciplines

Following the context-based learning approach, once we have figured out the sub-disciplines of data science, we should dig into their context to understand how they fit in with the overall scope of the field.

Computer Science

Why are all data science projects so coding-heavy?

Modern statistics dates back to the 19th century, yet the application of statistics was confined to small samples as there was no efficient means of organizing large amounts of data and calculating parameters. The computer was that means.

Furthermore, the advent of GPU parallel processing enabled machine learning models to train hundreds of times faster. In essence, incredibly powerful tools for statistics became accessible via the computer, thus the heavy emphasis on coding.

FURTHER Qs: What programming languages are the most important for data science? How much programming do I need for data science?

Statistics

Why is statistics important for data science?

Given that most of data science is simply computational statistics, this field lays out the groundwork and toolset for rigorous mathematical analysis of data.

FURTHER Qs: Just what the hell is all this talk about Bayes? What specific statistics libraries do data scientists use?

Linear algebra

What is linear algebra and how does it relate to data science?

Linear algebra is simply the study of linear equations. Multiple linear equations stacked together can be expressed as a matrix. Matrices, collections of numbers in rows and columns, are essentially equivalent to tabular data (data in a table). Moreover, image data is nothing but an n-dimensional vector of tuples (i.e. a list of a list of numbers). This is why a good understanding of linear algebra provides an understanding of the structure of data itself.

FURTHER Qs: What is a tensor? How is linear algebra used in deep learning?

Machine Learning & Calculus

What is the link between calculus and machine learning?

A critical component of calculus is the study of optimization. Since the objective of all machine learning algorithms is to minimize an error function, calculus provides the tools to understand how that minimization occurs.

FURTHER Qs: What is gradient descent? What is back-propagation? Why is calculus involved in it?

Next Steps

Ask yourself conceptual questions. Lots of conceptual questions. These questions will vary for everyone as their aim should be to patch the gaps of knowledge for how data science fits into your overall understanding of the field.

Get creative. A colleague of mine mentioned that visualization maps really helped her understand the context of AI, machine learning, and deep learning and how they all fit together. Similarly, use maps and flowcharts to understand any topics in data science you’re currently struggling to piece together.

Image by the author

After you’re armed with a strong contextual understanding of data science, go ahead and dig deep into the nuances of various supervised algorithms, the best practices for data preprocessing, or the creation of beautiful dashboard visualizations with Tableau.

Just try to make sure every new concept is put into context along the way.

--

--

Machine Learning Engineer. MS Data Science University of San Francisco. BS Cognitive Science w/ Machine Learning & Neural Computation UCSD.