<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Attention on Oriol Alàs Cercós</title><link>https://oriolac.github.io/tags/attention/</link><description>Recent content in Attention on Oriol Alàs Cercós</description><generator>Hugo -- 0.150.0</generator><language>en-us</language><copyright>Oriol Alàs Cercós</copyright><lastBuildDate>Mon, 17 Feb 2025 12:31:23 +0100</lastBuildDate><atom:link href="https://oriolac.github.io/tags/attention/index.xml" rel="self" type="application/rss+xml"/><item><title>Introduction to Attention Mechanism and Transformers</title><link>https://oriolac.github.io/posts/20241029-attention/</link><pubDate>Mon, 17 Feb 2025 12:31:23 +0100</pubDate><guid>https://oriolac.github.io/posts/20241029-attention/</guid><description>&lt;p&gt;Transformers have demonstrated excellent capabilities and they overcome challenges such &lt;em&gt;NLP&lt;/em&gt;, &lt;em&gt;Text-To-Image Generation&lt;/em&gt; or &lt;em&gt;Image Completion&lt;/em&gt;
with large datasets, great model size and enough compute.
Talking about transformers nowadays is as casual as talking about &lt;em&gt;CNNs&lt;/em&gt;, &lt;em&gt;MLPs&lt;/em&gt; or &lt;em&gt;Linear Regressions&lt;/em&gt;. Why not take a glance through this state-of-the-art architecture?&lt;/p&gt;
&lt;p&gt;In this post, we’ll introduce the Sequence-to-Sequence (Seq2Seq) paradigm, explore the attention mechanism, and provide a detailed,
step-by-step explanation of the components that make up transformer architectures.&lt;/p&gt;</description><content:encoded><![CDATA[<p>Transformers have demonstrated excellent capabilities and they overcome challenges such <em>NLP</em>, <em>Text-To-Image Generation</em> or <em>Image Completion</em>
with large datasets, great model size and enough compute.
Talking about transformers nowadays is as casual as talking about <em>CNNs</em>, <em>MLPs</em> or <em>Linear Regressions</em>. Why not take a glance through this state-of-the-art architecture?</p>
<p>In this post, we’ll introduce the Sequence-to-Sequence (Seq2Seq) paradigm, explore the attention mechanism, and provide a detailed,
step-by-step explanation of the components that make up transformer architectures.</p>
<h2 id="sequence-to-sequence-paradigm">Sequence-to-sequence paradigm</h2>
<p><strong>Seq2Seq</strong> was initially introduced in <strong>Recurrent Neural Networks (RNN)</strong> and later enhaced by <strong>Long Short-Term Memory (LSTM)</strong> networks.
This architecture splits the task into two primary components:</p>
<ol>
<li><strong>Encoder</strong>. It processes and compresses the input sequence into a fixed-length vector, commonly referred to as the context vector or <strong>hidden state</strong>.</li>
<li><strong>Decoder</strong>. Sequentially generates the target output using the information encoded in the context vector.</li>
</ol>
<p>In essence, we encode the input language and decode the language of translation. For example, English uses a Subject-Verb-Object (SVO) order, while Japanese goes with Subject-Object-Verb (SOV) and often skips the subject altogether.This flexibility lets Seq2Seq models adapt to these quirks and do a great job of capturing the meaning and flow of translations.</p>
<p>
<figure>
  <img loading="lazy" src="/posts/2024/seq2seq.png" alt="Seq2Seq architecture"  title="Fig. 1. The encoder model processes each token of the input sentence (`How are you?`), updating its hidden state with each step. Upon encountering the End of Sequence (`&lt;EOS&gt;`) token, the final hidden state is passed to the decoder model. The decoder then generates the output sequence (`お元気ですか`) token by token, starting with the Start of Sequence (`&lt;SOS&gt;`) token and continuing until the End of Sequence (`EOS`) token is reached."  /> 
  <figcaption
    style="
      font-size: 15px;
      color: #7a7a7a;
      margin-top: 0.5em;
      text-align: center;
      font-weight: 100;
    "
  >
  Fig. 1. The encoder model processes each token of the input sentence (<code>How are you?</code>), updating its hidden state with each step. Upon encountering the End of Sequence (<code>&lt;EOS&gt;</code>) token, the final hidden state is passed to the decoder model. The decoder then generates the output sequence (<code>お元気ですか</code>) token by token, starting with the Start of Sequence (<code>&lt;SOS&gt;</code>) token and continuing until the End of Sequence (<code>EOS</code>) token is reached.

  </figcaption>
  
</figure>
</p>
<h3 id="sequence-to-sequence-with-attention">Sequence-to-sequence with attention</h3>
<p>What are the next steps before talking about transformers? You may already know that transformers&rsquo; basis is the <strong>attention mechanism</strong>. But, what is attention?</p>
<p>One of the main challenges with Seq2Seq models is the <strong>fixed-length context vector</strong> passed from the encoder to the decoder. Since its fixed-length, the resulting context vector might have more information about the last tokens than the first ones. Hence, the decoder cannot focus on specific parts of the input sentence. For longer sentences, this bottleneck can result in loss of information, making translations less accurate or meaningful.</p>
<p>Attention tries to sort out this issue by allowing the decoder to focus on specific parts of the input sequence at each step of the generation process. Instead of relying only on the resulting context vector, the attention mechanism calculates a <strong>weighted combination</strong> of all encoder hidden states. This ensures that the decoder has access to the most relevant information.</p>
<blockquote>
<p>Attention is a mechanism that allows a model to <strong>focus</strong> on the most relevant parts of an input when making a prediction. We can say that it calculates, from a token, the weights (importance) of the other tokens on the fly.</p></blockquote>
<p>
<figure>
  <img loading="lazy" src="/posts/2024/Seq2SeqArchAtt.png" alt="Seq2Seq architecture with attention"  title="Fig. 2. The encoder model processes each token of the input sentence (`How are you?`), updating its hidden state with each step. The hidden states are stored at each encoding step until encountering the End of Sequence token. The decoder then generates the output sequence (`お元気ですか`) token by token, starting with the Start of Sequence (`&lt;SOS&gt;`) token and continuing until the End of Sequence (`EOS`) token is reached. The hidden state passed to the decoder is built using the attention mechanism at each step using the hidden states of the encoder model and the previous hidden state of the decoder model."  /> 
  <figcaption
    style="
      font-size: 15px;
      color: #7a7a7a;
      margin-top: 0.5em;
      text-align: center;
      font-weight: 100;
    "
  >
  Fig. 2. The encoder model processes each token of the input sentence (<code>How are you?</code>), updating its hidden state with each step. The hidden states are stored at each encoding step until encountering the End of Sequence token. The decoder then generates the output sequence (<code>お元気ですか</code>) token by token, starting with the Start of Sequence (<code>&lt;SOS&gt;</code>) token and continuing until the End of Sequence (<code>EOS</code>) token is reached. The hidden state passed to the decoder is built using the attention mechanism at each step using the hidden states of the encoder model and the previous hidden state of the decoder model.

  </figcaption>
  
</figure>
</p>
<ol>
<li><strong>Alignment</strong>. At each decoding step, the attention mechanism calculates a <strong>score</strong> to determine the <strong>relevance</strong> of each encoder hidden state to the current decoder state. There are a great amount of alignments but the most popular are <em>Bahdanau</em> (Additive Attention) and <em>Scaled-Dot Product Attention</em>.</li>
<li><strong>Weighting</strong>. Theses scores are normalized using <em>softmax</em> to generate a set of attention weights.</li>
<li><strong>Context vector</strong>. The attention weights are used to compute a weighted sum of the encoder hidden states, producing a context vector specific to the current output generation.</li>
<li><strong>Output Generation</strong>. The context vector is then combined with the decoder&rsquo;s state to generate the next token.</li>
</ol>
<p>The attention mechanism is useful in tasks like translation, where alignment between input and output sequences is important. Also, the <strong>selective focus</strong> makes the model more interpretable, since it can provide insights into which parts of the input the model considers relevant, offering a form of explainability.</p>
<p>Nevertheless, the attention mechanism requires computing attention scores between each decoder step and all encoder outputs and for long input sentences, this results in a large number of computations, scaling computation time and memory consumption.</p>
<p>Finally, it&rsquo;s important to note that the decoder in Seq2Seq architecture operates in an <strong>autoregressive manner</strong>, generating tokens one at a time. The sequential process limits parallelization during decoding, resulting into slower inference times compared to non-autoregressive models, which can generate multiple tokens simultaneously.</p>
<p>You may find more information in the tag <a href="http://oriolac.github.io/tags/seq2seq/" target="_blank" rel="noopener">seq2seq</a>.</p>
<h2 id="transformer-architecture">Transformer architecture</h2>
<p>Transformers [1] emerged as a way to built encoder-decoder architectures to solve machine translation problems. While <em>RNNs</em> and <em>LSTMs</em> use recurrent steps and can suffer more from vanishing gradients and limited parallelization, transformers bypass this by processing sequences in parallel.</p>
<p>The <strong>transformer neural network</strong> is composed by an encoder-decoder architecture  much like <strong>RNN</strong>. However, the difference is that the input sequence can be passed <strong>in parallel</strong> by passing also the positional encoder zipped with, as the input might have different meaning depending on its position.</p>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2024/transf_arch.PNG#center"
         alt="Transformer model architecture" width="40%"/> <figcaption>
            <p>Fig. 3. Transformer model architecture <a href="https://arxiv.org/pdf/1706.03762" target="_blank" rel="noopener">Attention is all you need</a>.</p>
        </figcaption>
</figure>

<p>The input and the positional encoding are passed into the <strong>encoder block</strong>. The job of the encoder is to map all input sequence into abstract continuous representation that holds the learned information for that entire sequence. The encoder block has \(N\) identical encoder layers.
The main objective of the encode is to capture the attention between tokens in both ways, also called <strong>self-attention</strong> or <strong>bidirectional attention</strong>. This means that in this part we attempt to capture each token&rsquo;s relevant parts from all the tokens of the sentence (although they are after the token). Hence, the encoder part is non-autorregressive.</p>
<p>Regarding the decoder block, it has several similarities with the <strong>encoder block</strong>. They both have \(N\) identical layers and a position encoding at first of all. However, multi-head attention layers of the decoder block have different job compared to the encoder. The decoder is <strong>auto-regressive</strong> and it takes the previous outputs from itself and the encoder output vector as inputs. This is because the encoder can use all the elements of the input sentence  but the decoder can only use the previous elements of the sentence. The attention captured in decoder blocks is called <strong>casual attention</strong>.</p>
<h3 id="positional-encoding">Positional Encoding</h3>
<p><strong>Positional encoding</strong> is the process of producing a vector that gives context based on position of the element in a sentence, so we will end up with a matrix of encoded positions. We could have only uni-dimensional vector of natural numbers like \([1, 2, ..., n]\). But one of the reasons we want positional encoding is not only feed the positions but their relationships. Therefore, they came up with a way of capture both absolute and relative positions with <strong>smooth representation of the position information</strong> (taking into account that the difference between 1000 and 1001 is &ldquo;smaller&rdquo; than the difference between 1 and 2) and providing better <strong>high-dimensional contextual information</strong>.</p>
\[ PE_{(pos, 2i)} = \sin\left(\frac{pos}{n^{\frac{2i}{d_{model}}}}\right)\]<p>
</p>
\[PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{n^{\frac{2i}{d_{model}}}}\right)\]<p>where</p>
<ul>
<li>\(pos\) is the position index of the token</li>
<li>\(i\) is the index of the encoding dimension</li>
<li>\(d_{model}\) is the dimensionality of the model&rsquo;s embedding space</li>
<li>\(n\) is the base of the frequency scaling factor, being set up to \(10000\)</li>
</ul>
<p>For every odd step, they create the vector using the cosine function while for every even time step, they use the sine function. These functions have linear properties the model can easily learn to attend to when adding these vectors to their corresponding vector. The result of these functions will be concatenated to the input embedding vector.</p>
<p>The most difficult part to understand from this formula might be the denominator. This part ensures that different frequency scales across dimensions. While lower dimensions of the positional encoding captures higher frequency variations, higher dimensions capture lower frequency variations, allowing them to encode larger positional distances smoothly. The following figure might help to understand the magic of it.</p>
<p>
<figure>
  <img loading="lazy" src="/posts/2024/att_pe_plot.png" alt="alt text"  title="Fig. 4. Example of positional encoding values of the sinus in some encoding dimensions."  /> 
  <figcaption
    style="
      font-size: 15px;
      color: #7a7a7a;
      margin-top: 0.5em;
      text-align: center;
      font-weight: 100;
    "
  >
  Fig. 4. Example of positional encoding values of the sinus in some encoding dimensions.

  </figcaption>
  
</figure>
</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> numpy <span style="color:#66d9ef">as</span> np
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> matplotlib.pyplot <span style="color:#66d9ef">as</span> plt
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">plot_sinus</span>(k, ax, d<span style="color:#f92672">=</span><span style="color:#ae81ff">512</span>, n<span style="color:#f92672">=</span><span style="color:#ae81ff">10000</span>):
</span></span><span style="display:flex;"><span>    x <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>arange(<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">200</span>, <span style="color:#ae81ff">1</span>)
</span></span><span style="display:flex;"><span>    denominator <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>power(n, <span style="color:#ae81ff">2</span><span style="color:#f92672">*</span>x<span style="color:#f92672">/</span>d)
</span></span><span style="display:flex;"><span>    y <span style="color:#f92672">=</span> np<span style="color:#f92672">.</span>sin(k<span style="color:#f92672">/</span>denominator)
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>plot(x, y)
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>set_title(<span style="color:#e6db74">&#39;k = &#39;</span> <span style="color:#f92672">+</span> str(k))
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>set_xlabel(<span style="color:#e6db74">&#34;Dimension&#34;</span>)
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>set_ylim([<span style="color:#f92672">-</span><span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">1</span>])
</span></span><span style="display:flex;"><span>    ax<span style="color:#f92672">.</span>set_xlim([<span style="color:#ae81ff">0</span>, <span style="color:#ae81ff">200</span>])
</span></span><span style="display:flex;"><span> 
</span></span><span style="display:flex;"><span>fig, axs <span style="color:#f92672">=</span> plt<span style="color:#f92672">.</span>subplots(<span style="color:#ae81ff">1</span>, <span style="color:#ae81ff">4</span>, figsize<span style="color:#f92672">=</span>(<span style="color:#ae81ff">16</span>, <span style="color:#ae81ff">4</span>))
</span></span><span style="display:flex;"><span>fig<span style="color:#f92672">.</span>tight_layout()
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> i, ax <span style="color:#f92672">in</span> enumerate(axs<span style="color:#f92672">.</span>flat):
</span></span><span style="display:flex;"><span>    plot_sinus(i<span style="color:#f92672">*</span><span style="color:#ae81ff">4</span>, ax)
</span></span></code></pre></div><h3 id="scaled-dot-product-attention">Scaled Dot-Product Attention</h3>
<p>There are several attention mechanisms: additive, content-base, badhanau [2]&hellip; Transformers introduced a new mechanisme called <strong>Scaled Dot-Product Attention</strong>. An attention function can operate using <strong>queries (Q)</strong>, <strong>keys (K)</strong> and <strong>values (V)</strong>.</p>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2024/scaled-dot-product.PNG#center"
         alt="Scaled Dot-Product Attention" width="25%"/> <figcaption>
            <p>Fig. 5. Scaled Dot-Product Attention <a href="https://arxiv.org/pdf/1706.03762" target="_blank" rel="noopener">Attention is all you need</a>.</p>
        </figcaption>
</figure>

<p>The query is a vector related with what we encode, the key is a vector related with what we use as input to output and the value is the learned vector as a result of calculations but related with the input. In other words, the <strong>query</strong> represents what we are looking for, the <strong>key</strong> represents possible matches in the input and the <strong>value</strong> represents the actual information associated with each key.</p>
\[Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}}) \cdot  V \]<p>where \(d_k\) is the dimensionality of the key vectors. The idea is practically the same: the result is a weighted sum of values, where more relevant elements contribute more to the output.</p>
<h3 id="multi-head-attention-layer">Multi-Head Attention Layer</h3>
<p>To give the encoder model more representation power of the self-attention, they created the <strong>Multi-Head Attention Layer</strong>. Instead of computing a single attention function, the Scaled Dot-Product Attention is splitted into several blocks called heads, each running in parallel. Each head independently computes attention and is then concatenated to form the final output.</p>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2024/multihead.PNG#center"
         alt="Multi-Head Attention Layer" width="30%"/> <figcaption>
            <p>Fig. 6. Multi-Head Attention Layer <a href="https://arxiv.org/pdf/1706.03762" target="_blank" rel="noopener">Attention is all you need</a>.</p>
        </figcaption>
</figure>

<p>The input of each head is first fed into three distinct fully connected layers to create the query, key and value vectors. These transformations allow the network to learn different types of relationships between tokens. The idea is that the attention block must map the query against a set of keys to then present the best attention, which will be embedded to the values.</p>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2024/multilinear.png#center"
         alt="Multi-Head Attention Layer representation" width="85%"/> <figcaption>
            <p>Fig. 7. Example of Multi-Head Attention Layer mechanism.</p>
        </figcaption>
</figure>

<p>After computing attention in each head, the results are concatenated and passed through a linear projection layer. This ensures that the output has the same dimensionality as the input, allowing for seamless integration with subsequent layers.</p>
<h4 id="masked-multi-head-attention-block">Masked Multi-Head Attention block</h4>
<p>The encoder block has <strong>two</strong> sub-layers: a <strong>multi-head attention</strong> layer and a <strong>feed forward</strong> layer. Both sub-layers have a residual connection and a layer normalization next to their output vector. The <em>residual connections</em> helps the network to train by allowing gradients to flow directly through the network while the normalization is used to stabilize the network.</p>
<p>The decoder block consists of <strong>three</strong> sub-layers: two multi-head attention layer and one feed-forward layer. The first multi-head attention layer in the decoder is masked to prevent it from attending to future tokens. This is achieved by applying a mask to the attention score matrix before computing softmax, ensuring that predictions for a given token do not depend on future tokens.</p>
<p>In the second multi-head attention layer, the queries and keys come from the encoder’s output, while the values are derived from the output of the first attention layer in the decoder. This mechanism enables the decoder to integrate information from the encoder while maintaining the structure of previously generated tokens. The final output is then processed by a feed-forward layer before being passed to a linear layer and a softmax function, which converts it into a probability distribution over possible output tokens.</p>
<h3 id="cross-attention">Cross-Attention</h3>
<p>The interaction between the encoder and decoder is facilitated by <strong>cross-attention</strong>. In the second multi-head attention layer of the decoder, the queries originate from the decoder’s previous output, while the keys and values come from the encoder’s output. This allows the decoder to focus on relevant parts of the input sequence when generating each token in the output. Cross-attention is essential for tasks such as machine translation, where the output sequence depends heavily on the input sequence.</p>
<h1 id="references">References</h1>
<p>[1] Ashish Vaswani, et al. <a href="https://arxiv.org/pdf/1706.03762" target="_blank" rel="noopener">Attention is all you need</a>, NIPS 2017</p>
<p>[2] Bahdanau et al. <a href="https://arxiv.org/pdf/1409.0473" target="_blank" rel="noopener">Neural Machine Translation by Jointly Learning to Align and Translate</a>, ICLR 2015</p>
<h1 id="citation">Citation</h1>
<blockquote>
<p>Alàs Cercós, Oriol. (Feb 2025). Introduction to Attention Mechanism and Transformers. <a href="https://oriolac.github.io/posts/2024-10-29-attention/" target="_blank" rel="noopener">https://oriolac.github.io/posts/2024-10-29-attention/</a>.</p></blockquote>
<pre tabindex="0"><code>@article{alas2025,
  title   = &#34;Introduction to Attention Mechanism and Transformers.&#34;,
  author  = &#34;Alàs Cercós, Oriol&#34;,
  journal = &#34;oriolac.github.io&#34;,
  year    = &#34;2025&#34;,
  month   = &#34;February&#34;,
  url     = &#34;https://oriolac.github.io/posts/2024-10-29-attention/&#34;
}
</code></pre>]]></content:encoded></item></channel></rss>