Coffee Data Science

Specialty Coffee: Comparing Grading Methods

Exploring different rulers of quality for coffee

Robert McKeon Aloe
Towards Data Science
6 min readDec 4, 2020

--

When I started getting interested in coffee, I didn’t care so much about grades or processes or flavors. I wanted rich, textured espresso, and I explored quite a bit. I only started paying attention to coffee grades when I started home roasting 6 years ago. Recently, I decided to look at the available data and gain a better understanding of how coffees are graded.

The dominate method of grading coffee is called Q-grading, developed by the Specialty Coffee Association (SCA). It was developed to standardize coffee grading across the world which meant that it had to overcome some of the subjective nature of taste. The Q-grading criteria is supposed to also separate specialty coffee ( Q-score ≥ 80 ) from lesser quality coffee.

Q-grading occurs during a blind cupping of multiple coffees, and a grader is trained to grade coffees in a standard way. They have to recalibrate every so often on some standard sets of coffees.

Sweet Maria’s has been selling green beans for over two decades. They originally scored coffee using the SCA grading scale, but they changed the way they graded coffee based on some criticizism that the SCA scale can be confusing. So they switched to a descriptive cupping system.

I’ve reviewed both of these scoring systems in previous posts, and I thought it would be good to compare them to each other. The aim is to determine what they share in common, and to do this, I used a similarity match score found in pattern recognition. If the score systems are similar, then for similar regions and processing types, their match scores will compare well. If they don’t compare well, it means the metrics don’t trend together.

This is the third of three articles using this SW dataset. The first one focused on how cupping grades and flavors relate to themselves and each other. The second article focused on using Sweet Maria’s cupping grades and flavors as similarity metrics to compare coffees.

Coffee Data

The Coffee Quality Institute (CQI) is a trust founded by the SCA. People submit green coffee samples which are then cupped and graded. A nice person built a database of over 1300 coffees from the CQI website, and I’ve previously looked at the metrics for their usefulness.

For Sweet Maria’s (SW), they have a database of their coffee grades and flavors going back a few years. I spent some time to extract this database (as it was not in a format easy to pull), and I looked at some comparisons between grades and flavors.

To compare CQI (SCA) grades with SW grades, I had to select only the metrics they have in common. For SCA scores, the categories of Uniformity, Clean Cup, and Sweetness are generally useless at comparing coffees. There is almost no variation in these scores, but they probably matter more for determining whether coffee is specialty or not.

SW on the left, SCA on the right

So I took these metrics based on their descriptions, and I found six metrics in common between them.

Similarity

In pattern recognition, a feature vector is used to compare two items, usually two signals or images. A score is computed between the vectors to determine how similar or dissimilar they are to one another.

To compute a similarity score for two coffees, each vector of sub-metrics was compared to all the others using Root-mean-square:

This is for 10 sub-metrics, so this equation would adjusted to 6

However, in the breakdowns below, I adjusted each graph to be between 0% and 100%. 100% doesn’t mean perfect match, and 0% isn’t no match. It’s relative to the data in each chart where 100% is the maximum similarity (most similar) and 0% is the minimum similarity (least similar).

Comparison using Similarity Scores

Regions

I started with the raw scores, but as you can see below, the CQI vs SW are not similar at all.

All images by author

So I normalized each SW submetric by shifting them to have the same average as the CQI submetric. Then I recomputed the similarity scores. This got them to be in the same ballpark, and more complex normalization techniques could have been used, but I prefer to stay as simple as possible.

This chart shows a few things. For both SW and CQI metrics, South American beans are most similar to themselves and even most similar when comparing SW to CQI. African beans and Indonesia & Asia beans have the largest gap in similarity across CQI and SW.

The pattern of regions vs themselves and each other vary a lot for SW scores but not so much for CQI. This could mean that SW scores better capture regional differences.

Green Bean Processing

We can examine green bean processing, but just Dry, Honey, and Wet processing because I don’t have enough data for subcategories of each. Honey processing is the most similar to itself for CQI and SW as well as CQI vs SW.

The other metrics don’t do so well. It seems CQI data is very homogenous compared to SW data.

Regions by Green Bean Processing

It is very important to look at regional differences across process. The Africa column and row for honey processed CQI is empty because there were not enough beans of that description.

This shows some finer details for Honey processed beans for the SW scores where South American scores poorly vs itself and others while in the CQI scores, that’s not the case.

For Dry SW scores, it seems that the African beans have a heavy influence on how well things match. For just South American and Central American, they have good similarity across Dry and Wet processed.

We can reorganize this table for more direct CQI vs SW comparisons. From this chart, it becomes clearer that South American and Central American beans have more similar scores across CQI and SW and for CQI vs SW.

This table can be reduced to regions vs themselves and regions vs others. The biggest thing here is that for Regions vs Self, the pattern for CQI vs itself and SW vs itself is very similar. The main differences come when comparing to others or when comparing CQI to SW.

This fun exploration showed that there are a lot of similarities between the CQI scores and the SW scores, but it seems the SW scores have more variation. This could possibility be due to a data in-balance since the CQI dataset is 3 times the size of SW. Central American and South American beans have the most similarities between CQI and SW scores across processing methods.

--

--

I’m in love with my Wife, my Kids, Espresso, Data Science, tomatoes, cooking, engineering, talking, family, Paris, and Italy, not necessarily in that order.