Teaching data science is broken

Ben Stenhaug
Towards Data Science
7 min readAug 23, 2017

--

The way we teach data science is broken. Actually, even that statement strikes me as generous. In most cases, I’m not sure we even try.

I don’t mean anything too grandiose by the term data science. In this context, I use ‘data science’ to mean the ability to take a large amount of data (more than excel could open comfortably), create graphs, generate summaries, and draw some sort of meaningful conclusion about what the data has to say.

It seems to me that people fall into roughly two cases when it comes to learning data science. One, people who should know data science and don’t. Two, people who want to learn data science and haven’t a clue how.

Case 1: People who should know data science

I should know data science. I majored in statistics at Wisconsin. But I left barely knowing R despite taking many applied courses. For the most part, my applied courses involved interpreting other people’s output from statistical programs. The professor would show us a screenshot, circle a number, and we’d have to tell him that was the percent of variance explained or something like that.

Sometimes we’d use R, but we’d use code someone else wrote and at the most change a few numbers or variable names. Or we’d watch the professor type into R over the course of a lecture. But no one ever sat me down and explained what a vector is, how to write a loop, or how to use lapply to replace a loop.

It’s tempting to conclude that I got an education in statistics but didn’t learn a particular software, so now all I have to do is learn R on my own and I’ll be good. But I don’t think it’s that simple. I implemented an algorithm the other day. To take the task from start to finish on my own, I had to know how to read in the data into R, format it in a useful way, summarize it into a table, implement, and use the documentation to know exactly what output I was getting.

Figuring out how to do that deep and complex task requires making so many connections that it’s hard to describe the counterfactual task of interpreting a screenshot of output as even learning.

There’s a missed opportunity here when we teach statistics without teaching data science in a meaningful way. If we teach both subjects concurrently, all of the statistical concepts and data science tools will connect in a way that will allow students to seriously learn. They can achieve a nuanced understanding that will translate to the ultimate goal of independent work. And as a bonus — the ability to use data science to solve statistical problems is what is valuable in the job market.

I sometimes make the claim that students should learn R (or Python or whatever language you want) in 6th grade before they start to dive into statistics. I’ve mostly gotten crazy looks for this suggestion, but I really believe it. Then their statistics education could be based in R. Teaching the mean? Show students how to simulate a vector of numbers. Ask them how they’d summarize those numbers. Gently coach them to the intuition of the mean. Or maybe just show them the mean(x) over and over again for different x’s. Can they figure out the puzzle? Why would we want to take the mean? Can we simulate a bunch of different x vectors and see how the mean changes?

Imagine what a deep understanding of both statistics and data science a student would have with this education.

Case 2: People who want to learn data science

I’m teaching an intro to Stata course and so I’ve been getting emails from people who want to learn data science but don’t know how. For example, maybe they majored in biology, are working as a researcher at a hospital right now, and want to apply to computational biology graduate programs in the future.

The first obstacle is to figure out what to learn. Should they learn R or Python or Stata or SPSS? It isn’t clear. Different people will have different answers. For someone who is busy and not positive they need to learn data science, their journey might end here.

Even if someone knows exactly what they want to learn, there aren’t great resources. Some people will suggest a book or two and that’s not a terrible idea, but inevitably, the typical student will reach some small gap in knowledge that they just can’t get past. With no one to ask for help, they’ll stumble on stackoverflow, which will make it look impossibly complicated, and they’ll sputter their wheels for a while and grow frustrated. At some point, that little voice in their head will say “you’re not cut out for this…” and then they’ll tell themselves they don’t really need to learn data science because if they know biology, they can always work with other folks who know data science. That’s sad because it doesn’t have to be like that.

There isn’t great advice either. A lot of people will suggest that students learn by attempting a project. That feels kind of empty to me. If you don’t know anything, starting a project is an impossibly daunting task. I do understand where this advice stems from — educators know learning requires a student to be active in the process, and a project is the best way to get active.

The problem is that learning really comes down to the student needing to be both active and sufficiently guided. Most solutions err in one of these directions. Tell a novice to start a project and it isn’t guided enough. Take a university class where you watch someone type code and it isn’t active enough. Finding the perfect balance is incredibly difficult. I can’t imagine there are many places in the world that manage to strike a balance. I was lucky enough to take the data challenge lab at Stanford last quarter, which is iterating quarter by quarter to get better and better. Bill Behrman leads the charge and Hadley Wickham helps teach it remotely. These courses are incredible learning environments, but unfortunately they are few and far between.

I now see this problem of graduate students struggling to cobble together their own data science education everywhere at Stanford. I’m not sure what to do about it. As as start, I wrote a rough one-pager regarding the problem of data science education for graduate students last quarter. I’ll share it below.

I also think that students just need to get started. Most resources go too deep, too fast, and it makes it hard to get started. Youtube can be a powerful tool, but most R videos go too quickly for the typical novice, I think.

As a response to that, I’m starting a side project where I make very basic R/Rstudio/Tidyverse videos that give the basic tools to do data science in R. I’ll post what I make at www.teachingr.com.

One-pager on data science in graduate school

For graduate students in many areas, data science skills are critical to producing high-quality research effectively.

Unfortunately, most students do not have a formal way to gain these skills. Many students struggle inefficiently throughout graduate school without ever truly becoming a skilled data scientist, others become proficient by their own sheer will, while presumably others are unsuccessful all together because of their lack of data science skills.

This is a known problem and there are many efforts to fix it. However, to my eye, most of those efforts are ineffective for a variety of reasons:

· Lecture-based instruction. Students need to do to learn

· Don’t provide resources that students can use on their own

· Provide code chunks and don’t force students to type… students don’t understand individual pieces of code

· Don’t teach actual programming logic

· Focus on too many languages

· Make data science and getting started seem harder than it is, and for most students, inaccessible

· Don’t have clear goals that are valuable for everyone

· Courses lack structure and are too brief to be effective

· The instructors aren’t adequately prepared and don’t use high quality pedagogy. Instead they ramble about what comes to mind

· Curse of knowledge: Instructors don’t have empathy and understanding for what it’s like to not know

· Instructors are not prescriptive enough about how to do things

· Instructors are too responsive to attendees — go on a 30-minute tangent about one person’s question

But it’s not that hard! I suspect that a student can make huge progress in 2 to 3 weeks in a well-designed environment where the student is both active and sufficiently guided. Here are a few principles:

· Sell students on the importance and to engage deeply. Promise actual results.

· Choose one language (I suggest R) and focus solely on that

· Start with coding logic to get a solid base

· Provide a go-to resource like an introductory book

· Work in teams on activity-based learning

· Ask the students to read/engage with new material outside of class

· Give students high-quality exercises that scaffold from very easy to expert

I suspect, in the Graduate School of Education at least, that the productivity of graduate students is highly correlated with baseline data science skills. I suspect that giving graduate students and postdocs high-quality training in baseline data science could increase research productivity for individual students up to 2x, this is particularly true in places like the Graduate School of Education where data science skills are scarcer. Demonstrating this correlation between data science skills and productivity would be interesting and may motivate making data science education early in graduate training a higher priority

--

--