The world’s leading publication for data science, AI, and ML professionals.

Rethinking how we teach the tidyverse

An attempt at an "unbiased" perspective from a tidyverse fanboy

Photo by Kelly Sikkema on Unsplash
Photo by Kelly Sikkema on Unsplash

I recently participated in a relatively popular Stack Overflow "contest" (what would "popular" even mean on Stack Overflow??), where the prompt was to write a more "elegant" dplyr or tidyverse solution to the solution presented.

The problem statement was to perform two regressions: 1) dep ~ cov_a + cont_a + cont_b and 2) dep ~ cov_b + cont_a + cont_b.

This was the original posted code:

map(.x = names(df)[grepl("cov_", names(df))],
    ~ df %>%
     nest() %>%
     mutate(res = map(data, function(y) tidy(lm(dep ~ cont_a + cont_b + !!sym(.x), data = y)))) %>%
     unnest(res))

and this was the sample dataset provided:

set.seed(123)
df <- data.frame(cov_a = rbinom(100, 1, prob = 0.5),
                 cov_b = rbinom(100, 1, prob = 0.5),
                 cont_a  = runif(100),
                 cont_b = runif(100),
                 dep = runif(100))

Pause here if you want and try coming up with a solution for fun!

If you’re not familiar at all with the Tidyverse then I suggest you skip ahead to the next section where I provide a more general overview of what it is. But bear with me because I want to use this example to illustrate a fundamental issue I realized with the tidyverse.

I posted an initial solution. Then someone posted 3 solutions. Then 4 more unique tidyverse solutions were suggested. That’s a total of 9 different ways to solve the problem of generating two regression models in a functionalized way (including the original posted solution).

What does this mean? Are there 8 solutions too many? Is it a good sign that there are so many ways to solve something? Let’s dive into this.

What is the tidyverse?

One way to view the tidyverseis as the brainchild of Hadley Wickham. The tidyverseis a constantly-evolving set of R packages (such as dplyr, purrr, etc.) that aim to provide a uniform interface that allow for a "tidy" workflow. This notion of "tidyness" builds upon the tidy data principles, to which I’ve included a link. In brief, tidy data has three characteristics:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

The aspiration behind a "tidy" workflow is more difficult to define. Hadley has currently conceptualized it into four main principles:

  1. To reuse existing data structures
  2. To compose functions with %>% a.k.a the "pipe"
  3. To use functional programming
  4. To make code more human-readable

To read more about his thoughts on each principle, check out his manifesto.

So, is there a debate?

Yes, there are dissidents of the tidyverseout there. I’m personally not one of them, but I think I understand where they are coming from. One popular dissident is Norm Matloff, who wrote a Martin Luther-esque 95 theses on why promoting the tidyverse is regressive. The biggest reason he cites is teachability – that the tidyverse makes R harder to learn and that it stunts basic R knowledge.

I don’t think I fully agree with that.

Matloff cites two primary reasons for the difficulty of teaching the tidyverse.

  1. It is cognitively overloaded
  2. It masks traditional programming skills and is counter-intuitive (I have since gotten the opportunity to speak with Norm, and he’s clarified this second point to mean that the tidyverse philosophy places "an emphasis on functional programming")

Here’s my take on his first point.

Take the Stack Overflow example I mentioned earlier. There were nine different tidyverse-based ways to do something. Because thetidyverseis composed of a series of packages, solving a data wrangling problem is more akin to building a lego house than fixing a clogged sink. In the latter, there’s probably a standard best way to do it, whereas in the former, there is something deeper – there’s a sense of creativity, of art. You’re not just building a house, you’re trying to make it as stable as possible, as pretty as possible. Maybe you want to build it as quickly as possible. There are a lot of different goals you can achieve, and thetidyverseattempts to give users those building blocks.

But I understand, it’s hard to just build a lego house if you’re given an overwhelming number and variety of pieces – you need instructions. Unfortunately, because the tidyverseis constantly evolving, it’s difficult to write the instruction manual. For instance, the *_at() and *_if() functions were superseded by across(). Instead of gather or the reshape package, we’re now meant to use the pivot_*() functions. It can be hard to keep up. Your code can quickly become outdated in a matter of months – it’ll still work, but it’ll be outmoded.

Here’s my take on his second point.

Anecdotally, almost all of my friends who studied computer science in undergrad, learning languages like C or Java, complain about R. It definitely is a weird programming language. But, R still maintains a lot of the same logic and syntax as any other programming language.

However, in the tidyverse we have things like %>%. To the uninitiated, that’s a WTF moment.

All of a sudden, things like:

c(1,2,3) %>% mean()

work, even though it’s quite unlike any other type of "traditional" programming. But, Matloff’s point was that this would be difficult for complete coding beginners to grasp. (EDIT: Norm Matloff has clarified his viewpoint on pipes in that "they’re fine (though not beneficial) for functions of one variable, very confusing if there is more than one" and emphasized that functional programming is difficult to grasp (which I agree with), so for beginners to R, it’s not reasonable to make this the "lynchpin of their learning", which I understand). There’s a body of evidence surrounding something called ambiguity tolerance and language learning. Basically, a healthy dose of ambiguity tolerance is associated with increased competency in learning a second language. I think we should place a little more trust in a student’s tolerance for ambiguity and trust that a complete novice will not be absolutely bewildered by tidyverse syntax. Of course, I’m not saying that students should not learn basic R syntax at all. I think the $operator, indexing a vector, writing for loops, etc. are very useful skills, and I wholeheartedly support that as a core part of a student’s R education. But, all I’m saying is I think we should tolerate what the tidyversebrings to the table that may seem unfamiliar.

So how do we teach it?

I think what the Stack Overflow problem made me realize was that there’s no one size fits all approach to coding. What does human readable mean? Do we care about speed? Should we use only the most frequently used functions? These, however, are maybe questions to be asked later in the hierarchy of learning R. Remember that often a beginner’s base need is to get from input to output. I think a supportive way to teach the tidyverse would be to:

  1. Set a standard curriculum that teaches fundamental core R concepts and builds tidyversevocabulary on top of that to manipulate data (after all, R is a statistical programming language). For instance, using this online textbook.
  2. Emphasize that some of the functions may be outdated, but that students need a base vocabulary to begin with in order to start doing things
  3. Emphasize that there is rarely one perfect solution. Using the tidyverse is like speaking a language, it not only has function, but it’s a way of communication. Students should be ok learning to communicate very rudimentarily, yet fluently and build up their skill to more beautifully and fluently express code

EDIT: Ultimately it’s up to the individual student/learner to decide their educational journey. I have taken a public stance in this Medium article, and I’ve had wonderfully illuminating conversations with Norm Matloff and others about varying viewpoints. I encourage students to mold their learning approach in a way that makes sense to them (whether through the tidyverse or not!)


Related Articles