The world’s leading publication for data science, AI, and ML professionals.

C is for Classification

Differences between classification and regression tasks.

A brief overview of the distinction between classification and regression problems.

Photo by Alexander Schimmeck on Unsplash
Photo by Alexander Schimmeck on Unsplash

What is classification?

Classification is one of two types of supervised machine learning tasks (i.e. tasks where we have a labeled dataset) with the other being regression.

Key point to remember: supervised learning tasks use features to predict targets, or, in non-tech speak, they use attributes/characteristics to predict something. For instance, we can take a basketball player’s height, weight, age, foot-speed, and/or multiple other aspects to predict how many points they’ll score or whether they will be an all-star.

So what’s the difference between the two?

  • Regression tasks predict a continuous value (i.e., how many points someone will score)
  • Classification tasks predict a non-continuous value (i.e. if someone will be an all-star)

How do I know which technique to use?

Answer the following question:

"Does my target variable have an order to it?"

For example, my project predicting the recommended age of a reader was a regression task because I was predicting a precise age (e.g., 4 years old). If I was attempting to identify whether a book was suitable for teens or not, then it would have been a classification task since the answer would have been either yes or no.

OK, so classification is only for yes/no, true/false, cat/dog problems, right?

Nope, those are just the easy examples 😄

Example 1: Sorting People into Groups

Imagine a scenario where you get a new batch of students every year and have to sort them into houses based on their personality traits.

Photo by emrecan arık on Unsplash
Photo by emrecan arık on Unsplash

In this situation, the houses do not have any type of sequence/ranking to them. Sure, Harry definitely didn’t want to be housed in Slytherin, and the Sorting Hat clearly took that into consideration, but that doesn’t mean Slytherin is closer to Gryffindor in the same way that 25 is closer to 30 than it is to 19.

Example 2: Applying Labels

Similarly, if we had a data set containing the ingredients of dishes and attempted to predict the country of origin, we’d be solving a classification problem. Why? Because country names have no numerical order. We can say that Russia is the largest country on earth or that China has the most people but those are attributes of the country (i.e., land size and population) which are not intrinsic to the name of the country.

Got it! Numbers are regression, words are classification

Sorry but no.

Think back to the book recommendation project I mentioned earlier and answer the following question: if I wanted to predict whether a book was for

  • small children (2–5 years old)
  • primary aged children (6–10)
  • tweens (11 -12)
  • young adults (13–17)
  • adults (18 +)

what would I use: regression or classification?

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

Well, given that the labels have a clear order, you’d definitely want to treat this as a regression problem by coding ‘small children’ as ‘1’ and ‘adults’ as ‘5’.

Conclusion

Thank you so much for reading! I hope this brief introduction to the key differences between classification and regression tasks has cleared up the questions you had and/or solidified what you already knew.

In closing, if you take away nothing else, I hope you never forget the following:

Always start by asking ‘What am I attempting to predict?’

Why? Because once that question is solved the rest becomes far simpler; as Carl Jung once stated, "To ask the right question, is already half the problem solved."

Further Reading

Explaining supervised learning to a kid (or your boss)


Originally published at https://educatorsrlearners.github.io on May 24, 2020.


Related Articles