Coffee Data Science

Comparing Coffee Q-Graders

Hoping to see how the objectivity of Q-grading could be measured

Robert McKeon Aloe
Towards Data Science
5 min readFeb 16, 2021

--

A few months ago, I reached out to Boomtown Coffee Roasters. They had a good share of their Q-score data, and I thought they might have a bit more I could look at. They kindly shared with me 400+ Q-scores across 200+ coffees. What was most interesting was that each coffee had two graders. So I did some analysis to try to understand how well their scores correlated with each other.

I assume the graders don’t share too much information on their scores while grading. In Q-grading, the aim is for different graders to grade the same across coffees. So this is an interesting test to understand how true this turns out to be.

Boomtown Data

They used Cropster to record and store information about the roast and their individual grades. Half of the coffees were graded by Dean and Chris or Dean and Savannah, but only one coffee lot was graded by all three. There are multiple entries for some coffees but not many.

AIl images by author

Q-Scores

A Q-score is a culmination of 10 factors recorded during coffee cupping. I’ve summarized each metric from how I understood it from the SCA Cupping Protocol. They define how to prepare the coffee samples, how to taste, and how to score. I believe there are plenty of resources online to help understand how to do cupping with a quick search.

The final score can also lose points due to defect beans in the patch as there is a standard for how many defects are allowed.

Similarity Metric

In pattern recognition, a feature vector is used to compare two items, usually two signals or images. A score is computed between the vectors to determine how similar or dissimilar they are to one another.

To compute a similarity score for two coffees, each vector of sub-metrics was compared to all the others using Root-mean-square:

However, in the breakdowns below, I adjusted each graph to be between 0% and 100%. 100% doesn’t mean perfect match, and 0% isn’t no match. It’s relative to the data in each chart where 100% is the maximum similarity (most similar) and 0% is the minimum similarity (least similar).

We can look over regions to see how scores differ. For Chris, his scores for Asia coffees correlate the lest to anyone else’s scores. Otherwise, most of their scores have a high similarity to one another, which indicates the Q-grading training is was done well and is objective or it indicates that by them grading coffees together, they have adjusted their scoring to be in sync with each other.

We can focus here on same region coffee and graders. Savannah and Dean have the most in common expect in South American beans, but otherwise, Savannah more closely matches Dean than Chris matches Dean or Savannah.

Correlation

Correlation is a metric to say how similar two variables are to each other. High correlation doesn’t mean one variable causes another variable, but that both variables go up or down the same when things change. I would assume from the start that some grading variables would have a high correlation because they are looking at taste from different points in time.

There are slight differences between the two in terms of correlation.

Correlation between Regions and Graders has some strange results. In theory, you should be most correlated to youself, but that doesn’t always seem to be true here. However, I down small variations in correlation are statistically significant.

Raw Differences

We can look at the raw differences and compare Chris and Savannah to Dean’s scores because Dean graded almost all the coffees.

Another way to look at the data is the average difference across all the sub-metrics to see where the differences were in raw numbers. There are two way to view this: average and average only including samples when the difference was not zero.

We can break these down into regional differences, which is another fun way to view the scores.

Dean vs Chris has no data for N/A

Both of these metrics ultimately tell us that there is not much difference between the two.

This was a unique dataset give that there are not many datasets of the same tasting sessions for multiple graders. In fact, I suspect this is the biggest dataset that has been composed simply because data being used in coffee is so new. The good news is that even though the differences across submetrics add up so some difference in the final score, that difference is not very large. Nor is the difference for any of the submetrics.

This gives a high confidence in one of two things: all three of these Q-graders were well trained to grade objectively or their scoring was normalized by living in a similar area, drinking similar coffees, and grading together. I hope this could help encourage other graders to share their data to help study the effectiveness of Q-grading as an objective measure of coffee.

--

--

I’m in love with my Wife, my Kids, Espresso, Data Science, tomatoes, cooking, engineering, talking, family, Paris, and Italy, not necessarily in that order.