The world’s leading publication for data science, AI, and ML professionals.

Tabyl – a frequency table for the modern R user

Out with the old, in with the new!

Image created using Canva Image Generator
Image created using Canva Image Generator

Anyone who has worked with categorical Data eventually came across a need to calculate the absolute number and proportion of a certain class. This article introduces the tabyl function for creating frequency tables through a series of hands-on examples.

What does tabyl bring to the table (no pun intended :D)?

The tabyl function is a feature of the janitor package in R. It’s a very convenient tool for creating contingency tables, otherwise known as frequency tables or cross-tabulations. Here are some of the benefits of using tabyl:

  1. Easy syntax: tabyl has an easy-to-use syntax. It can take one, two, or three variables, and it automatically returns a data frame that includes counts and proportions.
  2. Flexibility: tabyl can generate one-way (single variable), two-way (two variables), and three-way (three variables) contingency tables. This flexibility makes it suitable for a wide range of applications.
  3. Automatic calculation of proportions: tabyl automatically calculates the proportions (percentages) for one-way contingency tables. For two and three-way tables, the same result can be accomplished in combination with the adorn_percentages function from the same package.
  4. Compatibility with Dplyr: The output of tabyl is a data frame (or tibble), which makes it fully compatible with dply functions and the tidyverse ecosystem. This means you can easily pipe %>% the output into further data wrangling or visualization functions.
  5. Neat and informative output: tabyl provides neat and informative output, which includes the variable names as row names and column names, making it easier to interpret the results.

For all these reasons, tabyl is a great choice when you want to create frequency tables in R. It simplifies many steps and integrates well with the tidyverse approach to data analysis.

The dataset

Photo by Hans Veth on Unsplash
Photo by Hans Veth on Unsplash

This post will demonstrate the benefits of the tabyl function from the janitor package using the data on the edibility of different types of mushrooms depending on their odor. Here, I will be using a tidied dataset under the name mushrooms, but you can access the original data on Kaggle. Below is the code used for cleaning the data.

library(tidyverse)
library(janitor)

mushrooms <- read_csv("mushrooms.csv") %>%
  select(class, odor) %>%
  mutate(
    class = case_when(
      class == "p" ~ "poisonous",
      class == "e" ~ "edible"
    ),
    odor = case_when(
      odor == "a" ~ "almond",
      odor == "l" ~ "anise",
      odor == "c" ~ "creosote",
      odor == "y" ~ "fishy",
      odor == "f" ~ "foul",
      odor == "m" ~ "musty",
      odor == "n" ~ "none",
      odor == "p" ~ "pungent",
      odor == "s" ~ "spicy"
    )
  )

If you are unfamiliar with the above syntax, please check out a hands-on guide to using the tidyverse in one of my earlier articles.

Diving into the tidyverse using the Titanic data

The old

In order to better understand which advantages tabyl offers, let’s first make a frequency table using the base R table function.

table(mushrooms$class)

   edible poisonous 
     4208      3916 
table(mushrooms$odor, mushrooms$class)

            edible poisonous
  almond      400         0
  anise       400         0
  creosote      0       192
  fishy         0       576
  foul          0      2160
  musty         0        36
  none       3408       120
  pungent       0       256
  spicy         0       576

Unsurprisingly, it turns out that odor is a great predictor of mushroom edibility, with anything "funny-smelling" probably being poisonous. Thank you evolution! Also, there seem to be many more poisonous mushrooms, so it’s always important to be cautious when picking mushrooms on your own.

If we want to be able to use the variable names directly without specifying the $ operator, we would need to use the with command to make the dataset available to the table function.

mush_table <- with(mushrooms, table(odor, class))

Unfortunately, if we want to upgrade to proportions instead of absolute numbers, we can not use the same function but another one instead – prop.table .

prop.table(mush_table)

class
odor            edible   poisonous
  almond   0.049236829 0.000000000
  anise    0.049236829 0.000000000
  creosote 0.000000000 0.023633678
  fishy    0.000000000 0.070901034
  foul     0.000000000 0.265878877
  musty    0.000000000 0.004431315
  none     0.419497784 0.014771049
  pungent  0.000000000 0.031511571
  spicy    0.000000000 0.070901034

By default, this gives us a column-wise proportion table. If we want row-wise proportions, we can specify the margin argument (1 for row-wise and 2 for column-wise).

prop.table(mush_table, margin = 1)

class
odor           edible  poisonous
  almond   1.00000000 0.00000000
  anise    1.00000000 0.00000000
  creosote 0.00000000 1.00000000
  fishy    0.00000000 1.00000000
  foul     0.00000000 1.00000000
  musty    0.00000000 1.00000000
  none     0.96598639 0.03401361
  pungent  0.00000000 1.00000000
  spicy    0.00000000 1.00000000

All these special functions can feel cumbersome and hard to remember, so a single function which contains all the above funcionality would be nice to have.

Additionally, if we check the type of the created object using the class(mush_table) command, we see that it is of a class table.

This creates a compatibility problem, since nowadays R users are mostly using the tidyverse ecosystem which is centered around applying functions to data.frame type objects and stringing the results together using the pipe (%>%) operator.

The new

Let’s do the same things with the tabyl function.

tabyl(mushrooms, class)

     class    n   percent
    edible 4208 0.5179714
 poisonous 3916 0.4820286
mush_tabyl <- tabyl(mushrooms, odor, class)
mush_tabyl

     odor edible poisonous
   almond    400         0
    anise    400         0
 creosote      0       192
    fishy      0       576
     foul      0      2160
    musty      0        36
     none   3408       120
  pungent      0       256
    spicy      0       576

Compared to the corresponding table output, the resulting tables aretidier using the tabyl function, with variable names (class) being explicitly stated. Moreover, for the one-way table, aside from numbers, the percentages are automatically generated as well.

We can also notice that we didn’t have to use the which functio to be able to specify the variable names directly. Additionally, running class(mush_tabyl) tells us that the resulting object is of a data.frame class which ensures tidyverse compatibility!

The adorned janitor

Image created using Canva Image Generator
Image created using Canva Image Generator

For additional tabyl functionalities, the janitor package also contains a series of adorn functions. To get the percentages, we simply pipe the resulting frequency table to the adorn_percentages function.

mush_tabyl %>% adorn_percentages()

odor    edible  poisonous
   almond 1.0000000 0.00000000
    anise 1.0000000 0.00000000
 creosote 0.0000000 1.00000000
    fishy 0.0000000 1.00000000
     foul 0.0000000 1.00000000
    musty 0.0000000 1.00000000
     none 0.9659864 0.03401361
  pungent 0.0000000 1.00000000
    spicy 0.0000000 1.00000000

If we want the column-wise percentages, we can specify the denominator argument as "col".

mush_tabyl %>% adorn_percentages(denominator = "col")

     odor     edible   poisonous
   almond 0.09505703 0.000000000
    anise 0.09505703 0.000000000
 creosote 0.00000000 0.049029622
    fishy 0.00000000 0.147088866
     foul 0.00000000 0.551583248
    musty 0.00000000 0.009193054
     none 0.80988593 0.030643514
  pungent 0.00000000 0.065372829
    spicy 0.00000000 0.147088866

The tabyladorn combo even enables us to easily combine both the number and percentage in a same table cell…

mush_tabyl %>% adorn_percentages %>% adorn_ns

     odor           edible         poisonous
   almond 1.0000000  (400) 0.00000000    (0)
    anise 1.0000000  (400) 0.00000000    (0)
 creosote 0.0000000    (0) 1.00000000  (192)
    fishy 0.0000000    (0) 1.00000000  (576)
     foul 0.0000000    (0) 1.00000000 (2160)
    musty 0.0000000    (0) 1.00000000   (36)
     none 0.9659864 (3408) 0.03401361  (120)
  pungent 0.0000000    (0) 1.00000000  (256)
    spicy 0.0000000    (0) 1.00000000  (576)

… or add the totals to the rows and columns.

mush_tabyl %>% adorn_totals(c("row", "col"))

odor edible poisonous Total
   almond    400         0   400
    anise    400         0   400
 creosote      0       192   192
    fishy      0       576   576
     foul      0      2160  2160
    musty      0        36    36
     none   3408       120  3528
  pungent      0       256   256
    spicy      0       576   576
    Total   4208      3916  8124

Conclusion

The tabyl() function from the janitor package in R offers a user-friendly and flexible solution for creating one-way, two-way, or three-way contingency tables. It excels in automatically computing proportions and producing tidy data frames that integrate seamlessly with the tidyverse ecosystem, especially dplyr. Its outputs are well-structured and easy to interpret, and it can be further enhanced with adorn functions, simplifying the overall process of generating informative frequency tables. This makes tabyl() a highly beneficial tool in data analysis in R.


Related Articles