
Table of contents
- Introduction: attention in the human brain
-
Attention mechanisms in deep learning 2.1. RNNSearch 2.2. What exactly are keys, queries, and values in attention mechanisms?
-
Categorization of attention mechanisms 3.1. The softness of attention 3.2. Forms of input feature 3.3. Input representation 3.4. Output representation
- Conclusion
- References
1. Introduction: attention in the human brain
Attention is a cognitive and behavioral function that gives us the ability to concentrate on a tiny portion of the incoming information selectively, which is advantageous to the task we are attending. It gives the brain the ability to confine the volume of its inputs by ignoring irrelevant perceptible information and selecting high-value information. When we observe a scene with a specific important part related to the task we are doing, we extract that part to process more meticulously; we can learn to focus on those parts more optimally when those scenes appear again.
According to J. K Tsotsos et al. [1], the Attention Mechanism can be categorized into two classes.
- bottom-up unconscious attention
- top-down conscious attention
The first category is bottom-up unconscious attention – saliency-based attention – which is stimulated by external factors. To exemplify, louder voices can be heard more easily compared with quieter ones. We can produce similar results in deep learning models using the max-pooling and gating mechanism, which passes larger values (i.e. more salient values) to the next layer (S. Hochreiter et al. [2]). The next type is top-down conscious attention – focused attention – that has a predetermined goal and follows specific tasks. Therefore, using focused attention, we can concentrate on a specific object or phenomenon consciously and actively.
Attention is the allocation of a cognitive resource scheme with limited processing power as described by [3]. It is manifested by the attentional bottleneck, which limits the amount of information passed to the next steps. Hence, it can drastically enhance the performance by focusing on more important parts of the information. As a result, efforts have been made to reproduce the human brain’s attention mechanisms and incorporate spatial and temporal attention in various tasks. For instance, researchers brought this concept to machine vision by introducing a computational visual saliency model to capture potential salient regions of the images[4]. By adaptively selecting a sequence of regions, V. Mnih et al. proposed a novel recurrent model to extract the most relevant regions of an image and only process the selected ones. Fig. 2 illustrates an example of how visual attention operates. Bahdanau et al [5]. used attention mechanisms to allow a model to automatically search for parts needed for translating to the target word[5].

Attention mechanisms have become a crucial part of modern neural network architectures used in various tasks, such as Machine Translation [Vaswani et al. 2017–16], text classification, image caption generation, action recognition, speech recognition, recommendation, and graph. Also, it has several use-cases in real-world applications and achieved splendid success in autonomous driving, medical, human-computer interaction, emotion detection, financial, meteorology, behavior and action analysis, and industries.
2. Attention mechanisms in deep learning
Researchers in machine learning have been inspired by the ideas in biological fundamentals of the brain for a long time and nonetheless, it is still not totally clear how the human brain attends to different surrounding phenomena, they have tried to mathematically model them. To delve into the incorporation of deep learning and attention mechanisms, I will go through Bahdanau’s attention [5] architecture, which is a machine translation model. Fig. 1 shows typical common attention mechanisms.
2.1. RNNSearch
Prior to the model proposed by Bahdanau et al [5]., most architectures for neural machine translation go under the umbrella of encoder-decoder models. These models try to encode a source sentence into a fixed-length vector, from which the decoder generates a translation into the target language [see fig. 3]. One main issue with this approach is that, when the sentences become longer than those in the training corpus, it becomes tricky for the encoder to address. Therefore, the authors of this paper came up with an efficient solution to cope with this challenge by introducing a new method that jointly learns the translations and alignments. The idea here is that, during the process of translating a word in each step, it (soft-)searches for the most relevant information located in different positions in the source sentence. It then generates translations for the source word wrt. the context vector of these relevant positions and previously generated words jointly.

RNNSearch comprises a bidirectional recurrent neural network (BiRNN) as its encoder and a decoder to imitate the searching from the source sentence when decoding a translation (see fig. 4).
The inputs (from _X__1 to _X__T) are fed into the forward RNN to produce the forward hidden states.

and the backward RNN reads the inputs in the reverse order (from _X__T to _X__1), resulting in backward hidden states.

The model generates an annotation for _X_i by concatenating the forward and backward hidden states, resulting in h__i.

Another recurrent neural network (RNN) and an attention block make up the decoder. The attention block computes the context vector c, which represents the relationship between the current output and the entire inputs. The context vector _c_t is then computed as a weighted sum of the hidden states h__j at each time step:

_α_tj is the attention weight for each h__j annotation that is computed as follows:

and _e__tj is,

where a is an alignment model which represents how well the annotation _hj is well-suited for the next hidden state _s_t considering previous state st-1.
where _s__i is calculated as:

The model then generates the most probable output _y__t at the current step:

Intuitively, by using this formulation, the model can selectively focus on important parts of the input sequence at each time step and distribute the source sentence throughout the entire sequence instead of a fixed-length vector.

Fig. 4. illustrates the attention mechanism and integrates the underlying mathematics with the model visualization.
2.2 What exactly are keys, queries, and values in attention mechanisms?
Subsequent to the Bahdanau model adoption in machine translation, various attention mechanisms were devised by researchers. Generally speaking, there are two main common steps among them. The first one is the computation of attention distribution on the inputs, and the next one is the calculation of the context vector based on this distribution. In the process of computing the attention distribution.
The attention models infer the keys (K) from the source data in the first step of computing the attention distributions, which can be in a variety of forms and representations depending on the task(s). For instance, K can be a text or document embeddings, a portion of an image features (an area from the image), or hidden states of sequential models (RNN, etc.). Query, Q, is another term used for the vector, matrix [7], or two vectors that you are going to calculate the attention for, simply put, it is similar to _s_t-1 (previous output hidden state) in the RNNsearch. The goal is to figure out the relationship (weights) between the Q and all Ks through a scoring function f (also called energy function and compatibility function) to calculate the energy score that shows the importance of Q with respect to all K_s before generating the next output

There are several scoring functions that compute the connection between _Qs and K_s. A number of commonly used scoring functions are shown in table 1. Additive attention [6] and multiplicative attention (dot-product) [8] are among the most widely used functions.

In the next step, these energy scores are fed into an attention distribution function named g like the softmax layer in the RNNsearch to compute the attention weights α by normalizing all energy scores to a probability distribution

There are drawbacks to the softmax(z) function, i.e. it produces a result for every element of z which causes computational overhead in cases where the sparse probability distribution is required. Consequently, researchers proposed a new probability distribution function called sparsemax [9] that is able to assign zero value to the irrelevant elements in the output. Furthermore, logistic sigmoid [10] was proposed, which scaled energy scores to the [0–1] range.
Instead of using keys in the representation of input data for both context vectors and attention distributions – which makes the training process difficult – it is better to use another feature representation vector named V and separate these feature representations explicitly. To make it more tangible, in key-value attention mechanisms, K and V are different representations of the same input data and in the case of self-attention, all K, Q, and V are segregated embeddings of the same data i.e. inputs.
After calculating the attention weights, the context vector is computed as

where the ϕ is usually a weighted sum of V and is represented as a single vector:

and

where _Z_i is a weighted representation of V elements and n represents the size of Z_.
Simply put, common attention mechanisms ”can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key." as expressed by Vaswani et al. [7].
As for evaluating the attention mechanisms, researchers usually embed them inside deep learning models and measure their performance with and without attention to assess the attention mechanisms’ effect, i.e. ablation study. They can be visualized as illustrated in fig. 5, which in turn can be used for evaluation, although it is not quantifiable.
![Fig. 5. Attention distribution on yelp reviews. Darker colors mean higher attention. Photo from [27]](https://towardsdatascience.com/wp-content/uploads/2022/07/1hlLkNWAIy4WKFQ12k4PH4w.png)
3. Categorization of attention mechanisms
So far, I have gone through the mathematical modeling of attention mechanisms. In this section, I will provide more details on the taxonomy of these mechanisms. Although the attention mechanisms have a similar underlying concept, they vary in the details to a high degree to come up with the difficulty in applying them to different tasks. Multiple taxonomies have been proposed so far, which are basically similar. Here, I will use the models proposed by Niu et al. 2021 [6] and Chaudhari et al. 2021 [11].

Note that attention mechanisms in different criteria are not mutually exclusive and they most likely include different combinations of criteria.
3.1. The softness of attention
The softness of attention can be divided into four types:
- Soft: uses a weighted average of all keys to build the context vector.
- Hard: context vector is computed from stochastically sampled keys.
- Local: soft attention in a window around a position.
- Global: similar to soft attention.
In soft attention – first proposed by Bahdanau et al. [5] – the attention module is differentiable with respect to the inputs, thus, the whole model is trainable by standard back-propagation methods.
On the other hand, the hard attention – first proposed by Xu et al. 2015 [12] – computes context vector from stochastically sampled keys. Correspondingly, the α in Eq. 8 is calculated as:

Due to stochastic sampling, hard attention is computationally less expensive compared with soft attention which tries to compute all attention weights at each step. Obviously, making a hard decision on input features has its own setbacks such as non-differentiability which makes it difficult to optimize. Therefore the whole model needs to be optimized via maximizing an approximate variational lower bound or reinforcing.
Subsequently, Luong et al. 2015 [13] proposed local and global attention for machine translation. As previously stated, global attention and soft attention are similar mechanisms; however, local attention can be seen as a mixture of hard and soft attention, by which I mean it uses a subset of the input features instead of the whole vector. This approach, compared to soft or global attention, reduces the computational complexity and, unlike hard attention, it is differentiable which makes it easy to implement and optimize.
3.2. Forms of input feature
- Item-wise
- Location-wise
In the former one, input representation is a sequence of explicit items or equivalently an encoding from the inputs. To exemplify, Bahdanau et al. [5] use a single word embedding in RNNsearch and a single feature map is used in SENet. The attention model encodes these items as a separate code and calculates their respective weights during the decoding process. When combined with the soft/hard attention mechanisms, in the case of item-wise soft attention, it calculates a weight for each item, and then combines them linearly; while in hard attention, it stochastically selects one or more items based on their probabilities.
The latter tries to deal with input features that are difficult to pin down discrete definitions of items namely visual features. Generally speaking, they are usually used in visual tasks; the decoder processes the multi-resolution crop of the inputs at each step, and in some works, it transforms the task-related region into a canonical, intended pose to pave the way for easier inference throughout the entire model. In combining location-wise and soft attention, a whole feature map is fed into the model and a transformed output is produced. When we combine it with hard attention, the highest probable sub-regions WRT. the attention module are selected stochastically.

3.3. Input representation
Attention mechanisms have several forms of input representations, among which a number are more common, such as distinctive attention presented by Chaudhari et al. 2019 [11] which:
- Includes a single input and corresponding output sequence
- The keys and queries belong to two independent sequences
and co-attention, as well as hierarchical attention models, **** that accept multi-inputs such as in the visual question answering task presented by Lu et al. 2016 [14]. There are two ways for co-attention to be performed: a) Parallel: simultaneously produces visual and question attention; b) Alternative: sequentially alternates between the two attentions.
Fan et al. 2018 [15] proposed fine-grained and coarse-grained attention models. They used the embeddings of the other input as the query for each input to compute the attention weights. On the other hand, the fine-grained model calculates the effect of each element of the input on the elements of the other input.
There are several successful use cases in which they have deployed co-attention such as sentiment classification, text matching, named entity recognition, entity disambiguation, emotion cause analysis, and sentiment classification.
Self (inner) attention is another model which only uses inputs for calculating attention weights proposed by Wang et al. [16]. Key, Query, and value are representations of the same input sequence in different spaces; self-attention efficiency has been reproduced by several researchers in different manners, among which, Transformer [7] is widely acclaimed, the first sequence transduction model which only used self-attention without RNNs.
To extend the attention to other levels of inputs’ embeddings, Yang et al. [17] proposed a hierarchical attention model (HAM) for document classification. It uses two different levels of attention: word-level which allows HAM to aggregate relevant words into a sentence and sentence-level which enables it to aggregate key sentences into documents. Researchers even extended it to higher (user) levels; employed it at the document level. On the other extreme, a top-down approach was presented by Zhao and Zhang [18]. Some works have also brought hierarchical attention to computer vision, in which they used it for object-level and part-level attention. This was also the first image classification method that did not use extra information to calculate attention weights.
![Fig. 6. An example of hierarchical attention and co-attention. Photo from [14]](https://towardsdatascience.com/wp-content/uploads/2022/07/1GrbR-mi6W4v0s5RdeMoVaQ.png)
3.4. Output representation
Another criterion for categorization of the attention models is the way they represent the outputs. Single-output attention is a well-known method that outputs one and only one vector as its energy scores in each time step. Although the single output is a common method used in various situations, some downstream tasks require a more comprehensive context to be completed. Therefore, other methods such as multi-head and multi-dimensional attention – which go under the umbrella of multi-output attention models – proposed by researchers. By the way of example, it has been shown that in computer vision tasks as well as some sequence-based models, the single output attention may not appropriately represent the contexts of the inputs. Vaswani et al. [7] in Transformer, projected input vectors (K, Q, V) into multiple subspaces linearly, followed by scaled dot-product and then concatenating them using multi-head attention, as shown in Fig. 7, which allows the model to simultaneously calculate the attention weights based on several representation subspaces at different positions.
![Fig. 7. Multi-head attention. Photo from [7]](https://towardsdatascience.com/wp-content/uploads/2022/07/1_gts9qF-PZoQ_XjFix0FOQ.png)
To further reinforce the multi-head attention weights diversity, disagreement regularizations were added to the subspace, the attended positions, and the output representation; therefore, different heads were more likely to represent features in a more distinct way.
Next, multi-dimensional attention was devised to calculate the feature-wise score vector for K by using a matrix rather than the weight scores vector. It allows the model to infer several attention distributions from the same data. It can help with a prominent issue named polysemy in the realm of natural language understanding, where representations lack the ability to represent the coexistence of multiple meanings for the same word or phrase.
![Fig. 8. Multi-dimensional attention. Photo from [19].](https://towardsdatascience.com/wp-content/uploads/2022/07/1s9x7tboJ_IVt9Yayq_sC3w.png)
Like the previous, researchers added penalties (i.e. Frobenius penalties) to encourage the model to learn more distinct features from the inputs and proved its effectiveness by applying it to several tasks such as author profiling, sentiment classification, textual entailment, and distantly supervised relation extraction.
4. Research frontiers and challenges
Nevertheless, the attention mechanisms have been widely adopted in several research directions, there still exists great potential and leeway to maneuver. Some of the challenges researchers facing are mentioned below:
- As shown by Zhu et al. [20] and Tay et al. [21], combining K (keys) and Q (queries) has resulted in outstanding performances. Therefore it remains a question of whether it is beneficial to combine keys, queries, and values in self-attention.
- A number of recent studies by Tsai et al. [22] and Katharopoulos et al. [23] show that the performance of the attention models can be significantly improved by reducing the complicity of attention function as it greatly affects their computational complexity.
- Applying the attention models devised for specific tasks such as NLP to other fields, like computer vision, is also a promising orientation. By the way of example, when self-attention is applied to computer vision, it improves performance while negatively affecting efficiency as demonstrated by Wang et al. [24].
- Working on combining the adaptive mechanism and attention mechanism may lead to resembling the hierarchical attention results without any explicit architectural design.
- Devising new ways of evaluating the models is also of great importance. Sen et al. [25] have proposed multiple evaluation methods to quantitatively measure the similarities between attention in the human brain and neural networks using novel attention-map similarity metrics.
- Memory modeling is becoming a trend in deep learning research. Common models mostly suffer from the lack of explicit memory, hence, attention to memory can be studied further. The Neural Turing Machine paper [26] is an example of these hybrid models and has the potential to be explored more meticulously.
4.1 Collaboration
If you are interested in researching the topics mentioned on my personal website, I welcome further discussions. please drop me a message! Let’s connect on LinkedIn and Twitter.
Follow this Medium page and check out my GitHub to stay in tune with future content. Meanwhile, have fun!
Also, check out this article on Transformers:
5. Conclusion
In this article, I mainly discussed the attention models from various aspects such as a brief overview of attention and the human brain; its use-cases in deep learning; the underlying mathematics; presented a unified model for the attention mechanisms (derived from Niu et al. [6]); their taxonomy based on a range of criteria outlined multiple frontiers to be explored.
Needless to say, attention mechanisms have been a significant milestone in the course of the path toward artificial intelligence. They have revolutionized machine translation, text classification, image caption generation, action recognition, speech recognition, computer vision, and recommendation. As a result, they have achieved great success in real-world applications such as autonomous driving, medicine, human-computer interaction, emotion recognition, and many more.
6. References
[1] J.K. Tsotsos, S.M. Culhane, W.Y.K. Wai, Y. Lai, N. Davis, F. Nuflo, Modeling Visual Attention via selective tuning, Artif. Intell. 78 (1995) 507–545. [2] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997) 1735–1780. [3] J. R. Anderson, 2005, Cognitive Psychology and Its Implications, Worth Publishers, 2005. [4] Lu, S., Lim, JH. (2012). Saliency Modeling from Image Histograms. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds) Computer Vision – ECCV 2012. ECCV 2012. Lecture Notes in Computer Science, vol 7578. Springer, Berlin, Heidelberg. [5] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: ICLR. [6] Z. Niu, G. Zhong, H. Yu, A review on the attention mechanism of deep learning, Neurocomputing, Volume 452, 2021, pp. 78–88. [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: NIPS, pp. 5998–6008. [8] D. Britz, A. Goldie, M. Luong, Q.V. Le, Massive exploration of neural machine translation architectures, CoRR abs/1703.03906 (2017). [9] A.F.T. Martins, R.F. Astudillo, From softmax to sparsemax: A sparse model of attention and multi-label classification, in: ICML, Volume 48 of JMLR Workshop and Conference Proceedings, JMLR.org, 2016, pp. 1614–1623 [10] Y. Kim, C. Denton, L. Hoang, A.M. Rush, Structured attention networks, arXiv: Computation and Language (2017) [11] S. Chaudhari, V. Mithal, G. Polatkan, R. Ramanath, An Attentive Survey of Attention Models, ACM Transactions on Intelligent Systems and TechnologyVolume, 2021 Article No.: 53, pp 1–32. [12] K. Xu, J. Ba, R. Kiros, K. Cho, A.C. Courville, R. Salakhutdinov, R.S. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, in: ICML, Volume 37 of JMLR Workshop and Conference Proceedings, JMLR.org, 2015, pp. 2048–2057. [13] T. Luong, H. Pham, C.D. Manning, Effective approaches to attention-based neural machine translation, in: EMNLP, The Association for Computational Linguistics, 2015, pp. 1412–1421. [14] J. Lu, J. Yang, D. Batra, D. Parikh, Hierarchical question-image co-attention for visual question answering, in: NIPS, pp. 289–297. [15] F. Fan, Y. Feng, D. Zhao, Multi-grained attention network for aspect-level sentiment classification, in: EMNLP, Association for Computational Linguistics, 2018, pp. 3433–3442. [16] B. Wang, K. Liu, J. Zhao, Inner attention-based recurrent neural networks for answer selection, in: ACL (1), The Association for Computer Linguistics, 2016. [17] Z. Yang, D. Yang, C. Dyer, X. He, A.J. Smola, E.H. Hovy, Hierarchical attention networks for document classification, in: HLT-NAACL, The Association for Computational Linguistics, 2016, pp. 1480–1489. [18] S. Zhao, Z. Zhang, Attention-via-attention neural machine translation, in:AAAI, AAAI Press, 2018, pp. 563–570. [19] J. Du, J. Han, A. Way, D. Wan, Multi-level structured self-attentions for distantly supervised relation extraction, in: EMNLP, Association for Computational Linguistics, 2018, pp. 2216–2225. [20] X. Zhu, D. Cheng, Z. Zhang, S. Lin, J. Dai, An empirical study of spatial attention mechanisms in deep networks, in: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, IEEE, 2019, pp. 6687–6696. [21] Y. Tay, D. Bahri, D. Metzler, D. Juan, Z. Zhao, C. Zheng, Synthesizer: rethinking self-attention in transformer models, CoRR abs/2005.00743 (2020). [22] Y.H. Tsai, S. Bai, M. Yamada, L. Morency, R. Salakhutdinov, Transformer dissection: An unified understanding for transformer’s attention via the lens of the kernel, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP- IJCNLP 2019, Hong Kong, China, November 3–7, 2019, Association for Computational Linguistics, 2019, pp. 4343–4352. [23] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret, Transformers are rnns: Fast autoregressive transformers with linear attention, CoRR abs/2006.16236 (2020). [24] X. Wang, R. B. Girshick, A. Gupta, K. He, Non-local neural networks, in: CVPR, IEEE Computer Society, 2018, pp. 7794–7803. [25] C. Sen, T. Hartvigsen, B. Yin, X. Kong, E.A. Rundensteiner, Human attention maps for text classification: Do humans and neural networks focus on the same words?, in: D. Jurafsky, J. Chai, N. Schluter, J.R. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020, Association for Computational Linguistics, 2020, pp. 4596–4608. [26] A. Graves, G. Wayne, I. Danihelka, Neural Turing Machines, arXiv preprint: Arxiv-1410.5401. [27] Z. Lin, M. Feng, C.N. dos Santos, M. Yu, B. Xiang, B. Zhou, Y. Bengio, A structured self-attentive sentence embedding, in: ICLR (Poster), OpenReview. net, 2017.