
Every moment our sensory system is collecting a myriad of information and sending them to our brain in a format that it can understand. In a split second of looking at a scene, we can identify individual objects within our view. Even in the crowd, we can tell between people and trees, moreover, we can quickly spot faces that are familiar to us. How do we do this? Moreover, can a machine see the way we see?
Mechanism of Human Vision
The basic mechanism of our vision is as follows: lights from the external visual field enter our eyes. These lights are projected to the receptive field of photoreceptive cells where the light projection is converted into electrical signals. These signals from both eyes meet at a hub called the optic chiasm. Here, signals are separated into either side of the hemispheres based on their original visual field.
These separated signals go through different routes before arriving at the cortex. One of the main routes detects spatial information (the tectopulvinar pathway), and another path is responsible for processing patterns and colors (the geniculostriate pathway).

So now our brain has successfully composed the light signals into patterns, colors, orientation, location, etc. The primary visual cortex at the back of our brain receives these signals and distributes them deeper into our brain. One of the primary paths is through the inferior temporal cortex. As the signals move through our inferior temporal cortex, they converge and become more and more complex. Now the brain is able to infer what the combination of these signals is based on the knowledge we already have stored in our brain. This is roughly a known process our brain goes through to recognize and identify an object.
One of the key insights here comes from the 1981 Nobel Prize winners in Physiology or Medicine, David H. Hubel, and Torsten N. Wiesel. Hubel and Wiesel discovered that the receptive fields that elicit response signals at each layer of these sensory paths are built in response to the prior layers to adapt the different scale of inputs from each step of the way. In other words, each step of the transmission might be running its consolidation sequentially to adopt a more complex structure before it outputs the response signal. From light to edges and colors to objects within a scene to motions and so on…

This idea of sequential and accumulating abstraction is the basis of the architecture of the Convolutional Neural Network (CNN, ConvNet). CNN takes this idea of growing scales of receptive fields that process more and more complex inputs in order to hierarchically develop our perception and applies it at each convolution layer. Then it imitates the abstraction of our vision processing using the pooling method. Then, we can match semantic labels to the detected segments, which we will discuss in a separate post.

In the convolutional neural network, just like the simple artificial neural network (for a refresher: HERE), we update the weights of each convolution filter that attempts to identify the best predictive parameters at a sub-group level. Then it’s abstracted to output the gists that capture individual properties.

We discussed the basics of human vision and how it relates to Computer Vision. But let’s take a moment to think about the process. When we look at the above picture, we see a dog, even though there are a lot more than a dog, we prioritize what’s in our view to sequentially identify what’s more important. The key to a successful recognition of the scene involves our attention. In the next post, we will discuss more in-depth comparing human attention to the attentional network, and about how predictive coding theory (discussed in the previous post HERE) was applied to enhance the object recognition networks.
References
Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex
Previously on this series…
Happy Learning!
