YOLO is Sheep Obsessed: Environmental Context in Unified Object Detection

James Browning
Towards Data Science
3 min readDec 26, 2018

--

I recently adapted Joseph Redmond’s YOLO algorithm for MATLAB (so far without non-max suppression) and spent a fun hour testing it on a series of images. In general, the algorithm works great:

However, it doesn’t take long to realize that YOLO has a bit of an obsession with sheep. Note the mislabeling of the shepherd and dogs in the two images below:

Sheep, Sheep, Sheep….
Sheep, sheep, and some more sheep…

That guard dog looks like it could use a vacation. Somewhere she can feel like a dog, and not just another sheep in the flock:

Impressive Photoshop skills, I know.

That’s better. And the farmer and his collie? How about a trip to the moon?

A shepherd and his sneaky looking cow on the old Apollo studio lot in Hollywood (#moonlandingfaked)

Well, to be fair, collies do look like sneaky cows.

It appears that this unified object detector has learned to make it’s decisions within the context of an object’s environment. Unlike other CNN object detection algorithms, the entire image, including surrounding visual context, is fed through the network at once. I like to think of a CNN as learning a series of progressively more complex visual features in order to classify an object. However, I think YOLO has learned a convenient logical shortcut to the problem of detecting sheep — that sheep are generally found in flocks. Therefore, an object close to a sheep is more likely to be a sheep.

Let’s test whether the surrounding visual context that YOLO is responding to is not the flock of sheep but a correlated visual cue such as the surrounding landscape. Here is our shepherd and his species-confused collie in a field. Not just any field, but a field in Australia that looks an awful lot like the sort of hardscrabble place you’d be likely to find sheep:

Interesting. A shepherd without his sheep is still a person — even in a field (and a collie is still a sneaky cow).

This behavior of the YOLO detection algorithm reminds me of an argument I have heard a lot of lately regarding the potential future dangers of AI. By not specifying in sufficient detail how a learning algorithm finds a solution, or putting sufficient bounds on the the algorithm’s solution space, we may end up with solutions to complex important problems that are not to our liking. An example I heard recently (perhaps from AI party pooper Elon Musk?) is that an algorithm may “decide” that the easiest way to rid the world of cancer is to rid the world of all animals.

Of course this sheep example of an unexpected shortcut to object recognition is far less dire. Unless, of course, the algorithm is being used to automatically sort livestock for butchering in a stockyard…

Has anyone else noticed similar errors in YOLO or other unified object detection algorithms?

--

--