Image recognition: we can not get there from here with what we’ve got

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:


Leopards (or jaguars) are complex 3-dimensional shapes with quite a lot of degrees of freedom (considering all the body parts that can move independently). These shapes can produce a lot of different 2d contours projected on the camera sensor: sometimes you can see a distinct silhouette featuring a face and full set of paws, and sometimes it’s just a back and a curled tail. Such complex objects can be handled by a CNN very efficiently by using a simple rule: “take all these little spotty-pattern features and collect as many matches as possible from the entire image”. CNNs local filters ignore the problem of having different 2d shapes by not trying to analyze leopard’s spatial structure at all — they just look for black spots, and, thanks to nature, there are a lot of them at any leopard picture. The good thing here is that we don’t have to care about object’s pose and orientation, and the bad thing is that, well, we are now vulnerable to some specific kinds of sofas.
And this is really not good. CNN’s usage of local features allows to achieve transformation invariance — but this comes with the price of not knowing neither object structure nor its orientation. CNN cannot distinguish between a cat sitting on the floor and a cat sitting on the ceiling upside down, which might be good for Google image search but for any other application involving interactions with actual cats it’s kinda not.
If that doesn’t look convincing, take a look at Hinton’s paper from 2011 where he says that convolutional networks are doomed precisely because of the same reason. The rest of the paper is about an alternative approach, his capsule theory which is definitely worth reading too.
We’re doing it wrong

Maybe not all wrong, and of course, convolutional networks are extremely useful things, but think about it: sometimes it almost looks like we’re already there. We’re using huge datasets like ImageNet, organize competitions and challenges, where we, for example, have decreased MNIST recognition error rate from 0.87 to 0.23 (in three years) — considering that no one really knows what error rate a human brain can achieve. There’s a lot of talk about GPU implementations — like it’s just a matter of computational power now, and the theory is all fine. It’s not. And the problem won’t be solved by collecting even larger datasets and using more GPUs, because leopard print sofas are inevitable. There always going to be an anomaly; lots of them, actually, considering all the things painted in different patterns. Something have to change. Good recognition algorithms have to understand the structure of the image and to be able to find its elements like paws or face or tail, despite the issues of projection and occlusion.