Deep Learning (DL) is a type of machine learning (ML) that solves tasks using, let’s say, more advanced tactics. Problems that are complex for computer programs to handle and at the same time easy for us. Such as speech and image recognition [1].
When we see a car, we know in a fraction of a second what it is. It doesn’t matter the shape, from which angle or lighting. It’s easy for us to identify the patterns and distinguish from other objects. Why can’t traditional machine learning systems do this? For example, they work well for recommendation systems.
The difficulty lies in the vast and unpredictable number of variations that an image or speech can have. While in a recommendation system, in general, it’s about knowing the person’s and others’ preferences [2], for an image or speech, it’s more complicated.
Imagine that a car has four wheels, windows and one steering wheel, for example. And you train the program feeding this information. Easy right? But then in a particular time of the day, there was shadow obfuscating two of the wheels, or a person was leaned over on it. Think about speech recognition! How many ways can a certain word be pronounced? Imagine all the different accents. This makes it hard for machines. When it thinks it learned it, another feature comes up.
Another example is recognising handwritten digits. In how many ways one can write something?! See Figure 1. You will see that, for example, the nine is represented in different ways. And the same applies to other numbers. These variances make up for completely different pixels values.
![Figure 1. Part of the digits from MNIST dataset [1]](https://towardsdatascience.com/wp-content/uploads/2021/01/1XoqbFSIKbYG3F8e_voPsxA.png)
Deep learning solves the problem by breaking down the object into smaller and simpler components. Although the images appear different, there are parts between them that look the same or similar. There is a pattern.
It handles the object/concept as a layered structure, with each layer being an abstraction of the one next to it. It can go deep (hence the name) **** based on how granular the composition goes [1]. For example, a medium story can be divided into paragraphs, which then it can be divided into phrases and then words and lastly characters.

There are three types of layers. The input, the hidden ones and the output. They are made of nodes or neurons (making an analogy to the brain). The nodes are assigned with the pixel that represents a part of the image.
- Input: It’s the visible image. It contains all pixels. If the image is a grid of 10×10, the layer will contain 100 nodes.
- Hidden: The parts that make up the image. They are called ‘hidden’ because they are not part of the data given [1]. The first hidden layer can contain the edges; the second, the corners and contours, and the more concrete parts. The number of hidden layers can vary. And so the number of nodes in each one. It depends on determining which parts/concepts are useful to explain the observed object [1].
- Output: The object identified. It’s one node for each possible object.
Figure 3 below demonstrates the layered structure. (it doesn’t consider all the nodes and actual values. It’s more of a high-level representation)
![Figure 3. Layered structure [1].](https://towardsdatascience.com/wp-content/uploads/2021/01/1IYlQAYf18xF5ou41xTG5mw.png)
Final Thoughts
Thanks for reading. I hope you found it useful. Although it doesn’t explain the math and details about the algorithm, it provides a good foundation to get you started. At least it was my intent.
One last thing. It’s common to read that DL tries to mimic how we humans think. That’s correct to some extent. However, we don’t know for sure how we recognise objects (as of this writing, according to my research). There is little understanding [3] about the algorithm used by our brains. Nevertheless, I believe this is not important as long as machines can do it right. Like a plane that can fly without following exactly the birds’ mechanics.
Reference
[1] Ian Goodfellow, Yoshua Bengio and Aaron Courville, Deep Learning, 2016
[2] A Machine-Learning Item Recommendation System for Video Games https://scholar.google.com/scholar?q=recommendation+system+machine+learning&hl=en&as_sdt=0&as_vis=1&oi=scholart#d=gs_qabs&u=%23p%3D8hjsHbe4fwkJ
[3] How does the brain solve visual object recognition? https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3306444/