The world’s leading publication for data science, AI, and ML professionals.

Apriori Analysis in R beyond sells

The apriori algorithm is frequently used in the so_called "basket_analysis" to determine whether a given item is bought more frequently in…

The apriori algorithm is frequently used in the so_called "basket_analysis" to determine whether a given item is bought more frequently in combination with other items (like the famous beer&diaper example). As this is quite a niche use case, I wanted to test whether I could detect any patterns in a dataset not dealing with sells. I started off with the mushroom dataset from here, which contains mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms, limited to 23 species of gilled mushrooms in the Agaricus and Lepiota family. The underlying hypothesis is that I will re-extract the different species based on their description. By the way, one species of Agaricus is the well-known wide-spread salad/pizza/soup/place your dish here ingredient: the common button mushroom Agaricus bisporus:

source: By Darkone, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=671026
source: By Darkone, CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=671026

First, we read in the datafile and get column names from the agaricus-lepiota.name file. To get the table nicely presented, I had to flip it, so you see the columns and rows swapped, but still can decipher all column names and the first few rows. The dataset contains 22 columns and 8124 rows. Performing a quick check on parameter distribution within each feature we detect that the column "veil_type" contains only one value, so we can get rid of it, as it does not provide any additional information to classify the mushrooms.

path <- "my path on computer"
df <- read.table(file.path(path, "agaricus-lepiota.data"), sep=",")
names <- c("edible","cap_shape", "cap_surface", "cap_color", "bruises","odor","gill_attachment","gill_spacing", "gill_size","gill_color","stalk_shape","stalk_root","stalksurface_abovering","stalksurface_belowring","stalkcolor_abovering","stalkcolor_belowring","veil_type","veil_color","ring_number","ring_type","sporeprint_color","population","habitat")
colnames(df) <- names
as.data.frame(t(head(df))) #the table is flipped for demonstration purposes
#check, if there are columns of constant values
for (i in 1:ncol(df)){
  if(length(table(df[,i], exclude=NULL))<2){
    print(colnames(df)[i])
  }
}
table(df$veil_type, exclude=NULL)
#veil_type has only 1 value
df <- df[ , -which(names(df) %in% c("veil_type"))]

Next, we load the library and transform the dataset to a form which the Apriori Algorithm implemented in the arules package expects.

library(arules)
for (i in 1:ncol(df)){
  df[,i] <- paste(colnames(df)[i], df[,i], sep=".")
}
## create transactions
fungi <- as(df, "transactions")

In order to get a feeling for the algorithm, I first try to find some rules associated with edibility of the mushroom (though, as already mentioned, this is not the goal of the analysis. The strength of apriori should be its feature identification ability, irrespective of a target, as it is usually the case with common machine learning classification algorithms). The rule is a combination of features which are associated with the feature "edible" in the form "{x, y, z} => edible" (given x, y, z, the feature "edible" follows). However, I need to set a threshold on how confident I want to be about a certain rule. This can be achieved by varying different criteria: The support tells us, how frequent the particular feature combination has been observed within the dataset, e.g. a support of 0.1 corresponds to 0.1 x 8124 = 812 samples, meaning that at least 812 samples comply with the rule "x&y&z&edible". The confidence of a rule refers to the frequency of the feature combination given its left hand side: confidence ({x, y, z} => edible) = x&y&z&edible / x&y&z. This relation answers the question: Of all the mushrooms samples of the dataset, that comply with the rule "x&y&z", how many also comply with "x&y&z&edible"? The implemented algorithm of the arules package also allows to control for the minimal and maximal length of the rule (minlen, maxlen). Let’s have a look:

library(tidyverse)
supps <- c(); confs <- c(); nums <- c()
for (supp in seq(0.3, 0.7, by=0.05)){
  for (conf in seq(0.5, 0.9, by=0.05)){
    #run apriori algorithm looping over different support and confidence levels
    rules <- apriori(fungi,parameter = list(supp = supp, conf = conf, target = "rules", minlen=2,maxlen=22))
    #limit set of rules to ==>"edible"
    rules_sub <- subset(rules, subset = rhs %in% "edible=edible.e")
    #extract results
    supps <- c(supps,supp); confs <- c(confs, conf); nums <- c(nums,rules_sub@lhs@data@Dim[2])
  }
}
#combine results to data frame
results <- data.frame(support=supps,confidence=confs, count=nums)
results %>% arrange(desc(confidence)) %>% head(.)

We see that 0.4 x 8124 = 3249 mushroom samples comply with two rules with a =>"edible" condition with a confidence of 0.9. Not bad. Let’s have a closer look at those:

#repeat modeling with pre-selected thresholds for support and confidence
rules <- apriori(fungi,parameter = list(supp = 0.4, conf = 0.9, target = "rules", minlen=2,maxlen=22))
#limit set of rules to ==>"edible"
rules_sub <- subset(rules, subset = rhs %in% "edible=edible.e")
inspect(rules_sub)

And here it is. The apriori algorithm has identified the rule that if there is no odor, the mushroom is edible. This rule has a confidence of 0.97, which means that 97% of all non-smelling mushrooms were edible (in our dataset!). The lift of > 1 (which can be read as a kind of correlation between odor and edibility) tells us, that there is a positive correlation between those two features. The second rule is a bit more interesting. It tells us, that there is a positive correlation between broad gills (gill_size) and a smooth stalk surface above ring with the edibility of the mushroom. At least it has been found for 94% of the broad-gill, smooth-stalk specimens of Agarici. Let’s briefly check the result with a quick-and-dirty Random Forest, if we get similar features discriminating for edibility:

library(caret)
table(df$edible)
fit_control <- trainControl(method = "cv",number = 10)
set.seed(123456)
rf_fit <- train(as.factor(edible) ~ .,data = df,method = "rf",trControl = fit_control)
rf_fit
saveRDS(rf_fit, file.path(path, "fungiEdibility_RF_model.rds"))
rf_fit <- readRDS(file.path(path, "fungiEdibility_RF_model.rds"))
varImp(rf_fit)

Nice! Random Forest also selects odor as the most important feature to discriminate between edible and poisonous mushrooms. But this was not the goal of our analysis. We are still looking for the 23 different species or any interseting patterns within the dataset.

Therefore, I run the algorithm again, now looking not for rules but for frequent feature combinations. I set the minimal length of the feature set to 15.

#run apriori
itemset <- apriori(fungi,parameter = list(supp = 0.1, conf = 0.98,target = "frequent", minlen=15,maxlen=22))
#get top items
top_items <- data.frame(as(items(itemset), "list"))
top_items <- cbind(as.data.frame(t(apply(top_items,2,function(x) sub("[[:alpha:]]+_?[[:alpha:]]+=", "", x)))),itemset@quality)
#spread top items into a tidy dataframe
df2 <- df[0,]
colnames(df2)
for (rownr in 1:nrow(top_items)){
  for (cols in 1:(ncol(top_items)-3)){
        feature <- substring(top_items[rownr,cols], 1, unlist(gregexpr(".", top_items[rownr,cols]))-1)
        type <- substring(top_items[rownr,cols], unlist(gregexpr(".", top_items[rownr,cols]))+1)
      for (i in 1:ncol(df2)){
        if(feature==colnames(df2)[i]){
          df2[rownr,i] <- type
        } 
    }
  }
}
#exclude columns which have never been selected by apriori 
non_na_cols <- c()
for (k in 1:ncol(df2)){
  if(sum(is.na(df2[,k]))!=nrow(df2)){
    non_na_cols <- c(non_na_cols,colnames(df2)[k])
  }
}
df2[,non_na_cols]

Here I again flip the table, so that you can read it better. I have also added the translation for the selected levels of the different features that were identified by the algorithm.

Interestingly, the algorithm identified patterns for edible mushrooms only, presumably because there is higher variability among the poisonous Agarici, rather tham among the edible ones. We also see, that there is a sub-feature set common to all feature sets (edible&woods&bruises&no odor&free gills&close gill spacing&broad gills&tapering stalk&bulbous root&smooth stalk surface&white veil&one pendant ring). This representative feature set really strongly reminds me of the button mushroom. Let’s check how far we get with this description in nailing down the species. The web[page](https://www.mushroomexpert.com/agaricales.html) offers a key to different mushroom species. Starting from the very first page, we go through the description. As we have not read anything about our mushroom growing on other mushromms (but rather solitary in woods), question 1 leads to question 2, which we answer positively (our mushroom has gills). We follow the key further: 3) not as above → 4) gills not running down the stem (our gills were free) → Gilled Mushrooms. So far, so good. The link takes us to the next page. Following the key we get: 1) with veil and not fishy → 2) spores not pink → 3) gills not yellow and not running down the stem → 4) spore print not orange → 6) spore print dark (brown, black). We have this feature in two feature sets only, but I assume if the digression to pale-spored mushrooms was of importance, it would have been picked up by the algorithm. Following the key on we get to (I omit here the reading of the keys, you can walk through if you wish): 1 → 3 → 12 → 21 → 26 → Agaricus. Great, the algorithm has picked up some Agaricus-common features. We read a not too optimistic statement on the webpage that "you should not expect to be able to identify every Agaricus collection you make". And indeed, already the first two questions leave us without an answer, as the feature on flesh coloring upon slicing or Monterey cypress association was not adressed in the dataset. I guess the original key of the Audubon Society Field Guide might shed some light on this question, yet unfortunately I don’t have access to this resource. The only comfort is that the button mushroom Agaricus bisporus appears if answering "yes" to white cap color.

So, what is my takeaway from this toy analysis? The algorithm definitely picked up a feature set common to all mushrooms of the Agaricus genus. However, I would have guessed that starting with 8124 samples / 23 species = 353 samples per species, the algorithm would have picked up the feature sets common to other species. And indeed, running a quick principal component analysis on the dataset clearly shows that there are distinct clusters within the edible (green) and poisonous (red) samples, most likely corresponding to the different species.

theme_set(theme_classic())
#encode factors to hot-one, as PCA works on numeric features only
dummy <- dummyVars(" ~ .", data=df, fullRank=T)
df_hotone <- data.frame(predict(dummy, newdata = df))
mypca <- prcomp(df_hotone) ;  myscores <- mypca$x[,1:3] #scores
#add PCA results to table 
df_hotone_PCA <- cbind(df_hotone,myscores)
ggplot(df_hotone_PCA, aes(PC1,PC2))+
  geom_point(aes(color=factor(edibleedible.p, levels=c("1","0"))))+
  theme(legend.position = "none")+
  labs(title="Principal component analysis on mushroom dataset",subtitle = "colored by edibility", caption = "red = poisonous; green = edible")
Image by author
Image by author

This concludes this rather disappointing analysis. I guess in order to decipher the feature sets corresponding to the different species of the dataset, one needs to tune the apriori algorithm further or turn to another method. There is definitely room for improvement, so let’s pull up one’s sleeves and continue analyzing…in another story.


Related Articles