I first came across Cohen’s Kappa on Kaggle during the Data Science Bowl competition – though I did not actively compete and the metric was the quadratic weighted kappa, I forked a kernel to play around with the metric and see how it works since I had never seen it before. The launch of the University of Liverpool – Ion Switching competition has provided me with another opportunity to understand the metric better, though this time I will be competing actively since it is something that I told myself I will do more of this year. Now that I have made myself publicly accountable, let’s break down Cohen’s Kappa.
For those that follow my post, you would know that I like to break down my post into smaller sub-sections for comprehension and then connect them altogether. This post is no different and by the end, you will be capable of:
- Distinguishing between reliability and validity
- Explaining Cohen’s Kappa
- Evaluating Cohen’s Kappa
Validity and Reliability
Before getting into Cohen’s Kappa, I’d first like to lay an important foundation down about validity and reliability. When we talk of validity, we are concerned with the degree to which a test measures what it claims to measure or in other words, how accurate the test is. On the other hand, reliability is concerned more with the degree to which a test produces similar results under consistent conditions or to put it another way, the precision of a test.
Using the Ion Switching competition problem I will put these two measurements into perspective so that we know how to distinguish them from one another.
The problem that the University of Liverpool have – hence why they have provided us with the exciting Ion Switching competition – is that the existing methods of detecting when an ion channel is open are slow and laborious, and they want data scientist to employ machine learning techniques to enable rapid automatic detection of an ion channels current events in raw data. Validity would measure if the scores obtained represent whether the ion channels are opened, and reliability would measure if the scores obtained are consistent at identifying a channel is opened or closed.

For the results of an experiment to be useful, the observers of the test would have to agree on its interpretation, or else subjective interpretation by the observer can come into play therefore good reliability is important. However, reliability can be broken down into different types, Intra-rater reliability and Inter-rater reliability.
Intra-rater reliability is related to the degree of agreement between different measurements done by the same person.

Inter-rater reliability is related to the degree of agreement between two or more raters.

What is Cohen’s Kappa?
Cohen’s kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories.¹
A simple way to think this is that Cohen’s Kappa is a quantitative measure of reliability for two raters that are rating the same thing, corrected for how often that the raters may agree by chance.
Evaluating Cohen’s Kappa
The value for kappa can be less than 0 (negative). A score of 0 means that there is random agreement among raters, whereas a score of 1 means that there is a complete agreement between the raters. Therefore, a score that is less than 0 means that there is less agreement than random chance. Below, I will show you the formula to work this out, but it is important that you acquaint yourself with figure 4 to have a strong understanding.

The reason I highlighted two grids will become clear in a moment, but for now, let me break down each grid.
A => The total number of instances that both raters said were correct. The Raters are in agreement.
B => The total number of instances that Rater 2 said was incorrect, but Rater 1 said were correct. This is a disagreement.
C => The total number of instances that Rater 1 said was incorrect, but Rater 2 said were correct. This is also a disagreement.
D => The total number of instances that both Raters said were incorrect. Raters are in agreement.
In order to work out the kappa value, we first need to know the probability of agreement (this explains why I highlighted the agreement diagonal). This formula is derived by adding the number of tests in which the raters agree then dividing it by the total number of tests. Using the example from figure 4, that would mean (A + D)/(A + B+ C+ D).

Perfect! The next step is to work out the probability of random agreement. Using figure 4 as a guide, the expected value is the total number of times that Rater 1 said correct divided by the total number of instances, multiplied by the total number of times that Rater 2 said correct divided by the total number of instances, added to the total number of times that Rater 1 said incorrect multiplied by the total number of times that Rater 2 said incorrect. That is a lot of information to take in there so in figure 6 I have formulated this equation using the grid above.

Lastly, the formula for Cohen’s Kappa is the probability of agreement take away the probability of random agreement divided by 1 minus the probability of random agreement.

Great! You are now able to distinguish between reliability and validity, explain Cohen’s kappa and evaluate it. This statistic is very useful, although since I have understood how it works, I now believe that it may be under-utilized when optimizing algorithms to a specific metric. Additionally, Cohen’s kappa also does a good job of measuring both multi-class and imbalanced class problems.
P.S. If there is something that you want me to cover to do with Data Science, you can direct message me on Twitter @KurtisPykes or leave a response to this post.
¹ Wikipedia Definition. Cohen’s Kappa. https://en.wikipedia.org/wiki/Cohen%27s_kappa
MINT TMS Tutorials by Christian Hollmann. (Apr 5, 2015). Kappa Coefficient. https://www.youtube.com/watch?v=fOR_8gkU3UE
Zaiontz, Charles. Cohen’s Kappa. http://www.real-statistics.com/reliability/interrater-reliability/cohens-kappa/