<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Graph-Neural-Networks on Oriol Alàs Cercós</title><link>https://oriolac.github.io/tags/graph-neural-networks/</link><description>Recent content in Graph-Neural-Networks on Oriol Alàs Cercós</description><generator>Hugo -- 0.150.0</generator><language>en-us</language><copyright>Oriol Alàs Cercós</copyright><lastBuildDate>Wed, 24 Jun 2026 17:10:23 +0100</lastBuildDate><atom:link href="https://oriolac.github.io/tags/graph-neural-networks/index.xml" rel="self" type="application/rss+xml"/><item><title>Under the Hood of Graph Neural Networks: Message Passing, Over-Smoothing and Attention</title><link>https://oriolac.github.io/posts/20260624-gnns/</link><pubDate>Wed, 24 Jun 2026 17:10:23 +0100</pubDate><guid>https://oriolac.github.io/posts/20260624-gnns/</guid><description>&lt;p&gt;In this post we will present an introduction of how &lt;strong&gt;Spatial Graph Neural Networks (GNNs)&lt;/strong&gt; or &lt;strong&gt;Graph Convolutional
Neural
Networks (GCNs)&lt;/strong&gt; work. First, we are going to define graph data structures. Then, we are going to explain the mechanism
on GNNs. And finally, we will explain how to incorporate an attention mechanism in the network.&lt;/p&gt;
&lt;div class="callout callout-info" role="note"&gt;
&lt;div class="callout-body"&gt;
&lt;p class="callout-title"&gt;
&lt;svg xmlns="http://www.w3.org/2000/svg" width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"&gt;&lt;circle cx="12" cy="12" r="10"/&gt;&lt;path d="M12 16v-4"/&gt;&lt;path d="M12 8h.01"/&gt;&lt;/svg&gt;
Notation of GNNs&lt;/p&gt;
&lt;div class="callout-content"&gt;&lt;blockquote&gt;
&lt;p&gt;During the whole text, we will use the notation of GNN as Spatial Graph Neural Network, although GCN or Graph
Convolutional Neural Network is another notation to say it. There are other types of GNNs like Spectral Graph Neural
Networks, but in this post we will focus on the first mentioned ones.&lt;/p&gt;</description><content:encoded><![CDATA[<p>In this post we will present an introduction of how <strong>Spatial Graph Neural Networks (GNNs)</strong> or <strong>Graph Convolutional
Neural
Networks (GCNs)</strong> work. First, we are going to define graph data structures. Then, we are going to explain the mechanism
on GNNs. And finally, we will explain how to incorporate an attention mechanism in the network.</p>
<div class="callout callout-info" role="note">
  <div class="callout-body">
    <p class="callout-title">
    <svg xmlns="http://www.w3.org/2000/svg" width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><circle cx="12" cy="12" r="10"/><path d="M12 16v-4"/><path d="M12 8h.01"/></svg>

      Notation of GNNs</p>
    <div class="callout-content"><blockquote>
<p>During the whole text, we will use the notation of GNN as Spatial Graph Neural Network, although GCN or Graph
Convolutional Neural Network is another notation to say it. There are other types of GNNs like Spectral Graph Neural
Networks, but in this post we will focus on the first mentioned ones.</p></blockquote></div>
  </div>
</div>

<h2 id="what-are-graphs">What are graphs?</h2>
<p>Graphs are data structures with two main elements: the <strong>nodes</strong> and their relationships between elements, named
<strong>edges</strong>. The core of graphs is that we can define very easily the interaction of entities. Here, the links are key to
the power of relational-data. Graphs are used in several applications, for instance, in social networks, distribution
networks, and state-machines.</p>
<h3 id="first-steps">First steps</h3>
<p>We have already seen three clear concepts:</p>
<ul>
<li><strong>Graph</strong>. A data type consisting of nodes and edges.</li>
<li><strong>Nodes/Vertices</strong>. Endpoint in a graph, they are also called vertices or points.</li>
<li><strong>Edges</strong>. The link or relationship between nodes.</li>
</ul>
<p>When trying to describe how is a specific graph, we could always draw the elements and the link between them if
necessary. Although it is great to see some properties in a straight-forward way, we could not calculate anything
from it. One of the most common mathematical formalisms for describing relations in graphs are <strong>adjacency matrices</strong>.</p>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2026/gnn/adjacency_matrix.png#center"
         alt="Adjacency matrix of a graph." width="40%"/> <figcaption>
            <p>Fig. 1. Adjacency matrix of a graph.</p>
        </figcaption>
</figure>

<p>Graphs can be classified as directed or unidrected. In <strong>undirected graphs</strong>, the adjacency matrix is a symmetric matrix
across the diagonals. Each column and row
correspond to
a node and their values are their corresponding links to the nodes of the graph. For example, in column \(k\) we can see
the links of node \(v_k\) and in row \(k\) we can see the same links too, but transposed.</p>
<blockquote>
<p>Given two nodes \(v_j\) and \(v_i\), if the value of \(e_{i, j}\) or \(e_{j, i}\) is zero, that means that no link
exists between these nodes. Otherwise, there is a link between these nodes.</p></blockquote>
<p>If the adjacency matrix is not symmetric, we have a <strong>directed graph</strong>. The links of directed graphs have directions, so
that means that a node \(v_j\) can interact to \(v_i\), but it does not have to be the other way around. Directed edges
are usually represented by an arrow, denoting a one-way relationship.</p>
\[
G =
\begin{pmatrix}
1 & 0 & 0 & 1\\
0 & 0 & \textcolor{blue}{1} & \textcolor{blue}{1}\\
0 & \textcolor{red}{0} & 0 & 1\\
1 & \textcolor{red}{0} & 1 & 0
\end{pmatrix}
\qquad
\text{This is a directed graph.}
\]<p>If not all the non-zero values of the adjacency matrix are one, we are talking about <strong>weighted graphs</strong>. Weighted
graphs can have weighted links.</p>
<h3 id="other-key-terms">Other key terms</h3>
<p>Other key terms about edges and nodes are:</p>
<ul>
<li><strong>Self-loop</strong>. An edge that connects a node to itself.</li>
<li><strong>Parallel edges</strong>. Multiple edges that connect the same two nodes.</li>
<li><strong>Joint nodes or neighbours</strong> are those nodes that are directly connected via an edge. A node \(v_i\) is <strong>adjacent</strong>
to another node \(v_j\) if there is a edge.
The set of all the neighbours from a node is called its <strong>neighbourhood</strong>.</li>
</ul>
\[
\mathcal{N}(v_i) = \{ v_j \in V \mid (v_i, v_j) \in E \}
\]<figure class="align-center ">
    <img loading="lazy" src="/posts/2026/gnn/other_key_terms.png#center"
         alt="A graph that has a self-loop, parallel edges. The diagram shows the neighbourhood of node n (green node)." width="50%"/> <figcaption>
            <p>Fig. 2. A graph that has a self-loop, parallel edges. The diagram shows the neighbourhood of node.</p>
        </figcaption>
</figure>

<p>Once defined the different kind of graphs depending on the nodes and the edges, we can see other properties
checking the structure of a specific node.</p>
<ul>
<li><strong>Size</strong> is the number of edges.</li>
<li><strong>Order</strong> is the number of nodes.</li>
<li>The <strong>degree of a node</strong> is the number of edges associated to a node. It is the count of its adjacent nodes.</li>
<li>The <strong>degree distribution</strong> is the distribution of all the degrees of all nodes in a graph. In directed graphs, there
are two types of degrees: the <strong>in-degrees</strong> for edges directed to the node and the <strong>out-degrees</strong>, for edges
directed outward from the node.</li>
<li>The <strong>diameter</strong> of a graph is the maximum number of the longest shortest path.</li>
</ul>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2026/gnn/basic-info.png#center"
         alt="Degrees and diameter of the graph. The diameter is 4 given the walk (0,1,2,5,6)."/> <figcaption>
            <p>Fig. 3. Degrees and diameter of the graph. The diameter is 4 given the walk (0,1,2,5,6).</p>
        </figcaption>
</figure>

<ul>
<li>A <strong>connected graph</strong> is a graph where all its nodes are connected. Otherwise, it is a <strong>disconnected graph</strong>.
<ul>
<li>For a disconnected graph, each disconnected piece is called a <strong>component</strong>.</li>
<li>For a directed graph, a <strong>strongly connected graph</strong> is when it is always possible to reach any node from any
other node. In directed graphs, we can have <strong>strongly connected components</strong> and <strong>weakly connected components</strong>.
<ul>
<li>Strongly connected components are the subgraphs that are connected to each other and any node in
the subgraph can reach any other node in the subgraph.</li>
<li>Weakly connected components
are subgraphs where any node can be reached but not all the nodes can reach each other.</li>
</ul>
</li>
</ul>
</li>
</ul>
<div class="callout callout-curiosity" role="note">

  <div class="callout-body">
    <p class="callout-title">
    <svg xmlns="http://www.w3.org/2000/svg" width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M15 14c.2-1 .7-1.7 1.5-2.5 1-.9 1.5-2.2 1.5-3.5A6 6 0 0 0 6 8c0 1 .2 2.2 1.5 3.5.7.7 1.3 1.5 1.5 2.5"/><path d="M9 18h6"/><path d="M10 22h4"/></svg>
      Graph Traversals</p>
    <div class="callout-content"><p>When inspecting a graph we might ask ourselves a lot of questions. For example, in a social network how many connections
do I have to hop to meet Meryl Streep? In a distribution system, which is the smallest number of hops from one node to
another? Which is the largest number of hops without repeating in a graph?</p>
<p>The trip that we have to do to travel from a given node to a second node is called <strong>traversal</strong> or <strong>walk</strong>. A walk is
<strong>open</strong> when the ending node is different from the starting node. If we start and end with the same node, we call it
<strong>closed walk</strong>.</p>
<p>A <strong>path</strong> is a walk when no node is repeated. When the path is closed, we call it <strong>cycle</strong>. The
<strong>diameter of a graph</strong> is the maximum number of steps of a path. A <strong>diameter</strong> is also called the <strong>longest shortest
path</strong>.</p>
<p>A <strong>trail</strong> is when the walk do not repeat edges. A <strong>circuit</strong> is when we have a closed trail.</p>
</div>
  </div>
</div>

<h2 id="how-graph-neural-networks-work">How Graph Neural Networks work?</h2>
<p>A Graph Neural Network (GNN) is a deep learning model that allows you to represent and learn from graphs
(Scarselli et al., 2009). The reason why
they were invented is due to the lack of <strong>inductive biases</strong> in the nature of traditional machine learning and deep
learning models such as <strong>MLP</strong>. In machine learning involving tabular data, images, or text, our data is organized in
an
expected way, with implicit and more explicit rules. When dealing with tabular data, rows are treated as observations
while columns as features. But in graph data, relations between &ldquo;rows&rdquo; or instances are meaningful. IN <strong>GNNs</strong>, we have
the information of each node, called <strong>embeddings</strong> and the links that connects each node to its neighbourhood.</p>
<div class="callout callout-info" role="note">
  <div class="callout-body">
    <p class="callout-title">
    <svg xmlns="http://www.w3.org/2000/svg" width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><circle cx="12" cy="12" r="10"/><path d="M12 16v-4"/><path d="M12 8h.01"/></svg>

      Main idea of GNNs</p>
    <div class="callout-content"><blockquote>
<p>The essential idea of graph neural networks is to <strong>iteratively update the node representations</strong> by combining the
representations of their neighbours and their own representations.</p></blockquote>
<blockquote>
<p>The main objective of a Graph Neural Network is to encode the data structure to then predict. But, what? There are
several tasks that we could come up with, but they can be classified as:</p>
<ul>
<li><strong>Node-level tasks</strong>. Given a graph, classify the nodes, use regression, or detect anomalies.</li>
<li><strong>Edge-level tasks</strong>. Similar to the node-level tasks, but for edges.</li>
<li><strong>Graph-level tasks</strong>. Graph classification or regression. <em>Graph AutoEncoders (GAE)</em> could be also in this
category.</li>
</ul></blockquote>
</div>
  </div>
</div>

<p>In section 1 we have already understood the importance of the links when defining a graph. GNNs came up to use this
nature in the network architecture: use a mechanism to <strong>encode and exchange information</strong> across the
<strong>graph structure</strong> during the inference: the <strong>graph convolutional layers (GCL)</strong>. The GCN layer \(l\) is
mathematically
defined as</p>
\[
\mathbf{x}_i^{(l)} = \sum_{j \in \mathcal{N}(i) \cup \{ i \}} \frac{1}{\sqrt{\deg(i)} \cdot \sqrt{\deg(j)}} \cdot
\left(\mathbf{W}^{\top} \cdot \mathbf{x}_j^{(l-1)} \right) + \mathbf{b},
\]<p>where neighboring node features are first transformed by a learnable weight matrix \(\mathbf{W}^{\top}\),
normalized by their degree, and finally summed up. Lastly, we apply the bias vector to the aggregated output.</p>
<p>However, when we talk unformally about GCL, we talk about a bundle of sequential layers:</p>
<ul>
<li><strong>Message Passing layer</strong>: where information is aggregated from neighbours and updated for each node.</li>
<li><strong>Activation Layer</strong>: where the information is passed to the next layer.</li>
<li><strong>Dropout layer</strong>: switching off some neurons to improve generalization and performance.</li>
<li><strong>Normalization Layer</strong>: normally Batch Normalization, that means the activated outputs to zero with a variance of 1.</li>
</ul>
<h3 id="the-message-passing-method">The message-passing method</h3>
<p><strong>Message passing</strong> is designed specifically for asking about the graph data structure. For each node in the graph, each
message passing step represents a communication that spans nodes one hop away. If we want our node representations to
take account node from 5 hops from each node, we should need 5 message passing layers.</p>
<p>Message passing can be understood as a form of <strong>convolution</strong> but applied to the neighbourhood of the nodes instead of
a neighbourhood of pixels. Through convolution operators, we are encoding neighbour states to the current node and
gathering more global information of the graph.</p>
<div class="callout callout-curiosity" role="note">

  <div class="callout-body">
    <p class="callout-title">
    <svg xmlns="http://www.w3.org/2000/svg" width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M15 14c.2-1 .7-1.7 1.5-2.5 1-.9 1.5-2.2 1.5-3.5A6 6 0 0 0 6 8c0 1 .2 2.2 1.5 3.5.7.7 1.3 1.5 1.5 2.5"/><path d="M9 18h6"/><path d="M10 22h4"/></svg>
      Inductive biases of GNNs</p>
    <div class="callout-content"><blockquote>
<p>Hence, Graph Neural Networks can be considered an abstraction of <strong>Convolutional Neural Networks</strong>.</p></blockquote></div>
  </div>
</div>

<p>A popular way to introduce GCLs is to break down the filter into two operations:</p>
<ul>
<li><strong>AGGREGATE-NODES</strong>. Given a node \(v_i\) and its linked nodes \(\mathcal{N}(v_i)\), we aggregate the information of
each neighbour node:</li>
</ul>
\[
m_i^{(l)} = \operatorname{AGGREGATE}
\left(
\left\{ h_j^{(l)} \mid v_j \in \mathcal{N}(v_i) \right\}
\right)
\]<p>The aggregation function \(AGGREGATE\) is commonly a sum operation, although it can be the mean, minimum, maximum, or
multiplication.</p>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2026/gnn/graph-message-agg.png#center"
         alt="Aggregate function of node 6. This picture" width="90%"/> <figcaption>
            <p>Fig. 4. Aggregate function of node \(6\). This step aggregates the embeddings of the neighbours of the node by multiplying or summing. The resulting value is the message \(m_6^{(l)}\)</p>
        </figcaption>
</figure>

<ul>
<li><strong>UPDATE-EMBEDDING</strong>. Then, we update the embedding of the node \(v_i\) with the aggregated information applying
linear transformations:</li>
</ul>
\[
h_i^{(l+1)} = \sigma \left(W^{(l)} \cdot \operatorname{CONCAT} \left(h_i^{(l)}, m_i^{(l)} \right)  + b^{(l)} \right)
\]<figure class="align-center ">
    <img loading="lazy" src="/posts/2026/gnn/graph-first-update.png#center"
         alt="Update embedding function: concat operator and bias addition" width="90%"/> <figcaption>
            <p>Fig. 5. We concat the current embedding of the node \(6\) with the aggregate message of its neighbours. The resulting value is multiplied by the kernel matrix \(W^{(l)}\) and added to the bias vector \(b^{(l)}\).</p>
        </figcaption>
</figure>

<p>Since we will be working on convolutions, a key item must be introduced: <strong>the kernel</strong> \(W^{(l)}\). The kernel is a
matrix that we will be using to transform the input data (from the neighbours) and highlight specific features from
them. The kernel is the learnable weight matrix to be optimized by our loss function, along with the bias \(b^{(l)}\).</p>
<div class="callout callout-info" role="note">
  <div class="callout-body">
    <p class="callout-title">
    <svg xmlns="http://www.w3.org/2000/svg" width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><circle cx="12" cy="12" r="10"/><path d="M12 16v-4"/><path d="M12 8h.01"/></svg>

      Inductive biases of GNNs</p>
    <div class="callout-content"><blockquote>
<p>Graph-based learning techniques focuses on approaches that are <strong>permutation invariant or equivariant</strong>. This means
that the model is not influenced by the ordering of the graph representation. Therefore, if we shuffle the rows and
the columns of the adjacency matrix, our results should not change.</p></blockquote></div>
  </div>
</div>

<p>Although this example helps to understand how GNNs work at first glance, it is not the most used algorithm. The generic
message-passing equation is</p>
\[
\mathbf{x}_i^{(l)} = \gamma^{(l)}
\left(
\mathbf{x}_i^{(l-1)},
\bigoplus_{j \in \mathcal{N}(i)}
\phi^{(l)}
\left(
\mathbf{x}_i^{(l-1)},
\mathbf{x}_j^{(l-1)},
\mathbf{e}_{j,i}
\right)
\right)
\]<p>where \(\bigoplus\) is the aggregator function, \(\phi^{(l)}\) is the message function at layer \(l\) and
\(\gamma^{(l)}\) is the update function. At first, it might look a bit messy since the order is not defined as it is in
the example above. So let&rsquo;s break it by parts again.</p>
<ul>
<li>The message function \(\phi^{(l)}\) defines what information each neighbor sends to node \(i\). In the first example,
the message function is based solely on sending the embedding information \(h_j^{(l)}\) to node \(i\). Nevertheless,
differentiable functions such as MLPs (Multi Layer Perceptrons) can be applied. In addition, the message function can
also have as an input the node embedding \(h_i^{(l)}\) as well as the link between the neighbours
(useful for weighted graphs).
In our example,
\[ \phi^{(l)} \left(\mathbf{x}_i^{(l-1)}, \mathbf{x}_j^{(l-1)}, \mathbf{e}_{j,i} \right) = x_j^{(l-1)}\]</li>
<li>The aggregator function \(\bigoplus\) is the permutation invariant function: sum, max, mean&hellip;</li>
<li>The update function \(\gamma^{(l)}\) is the function that updates the node embedding using the old embedding plus the
aggregated message. As you can see, the generic message-passing joins the update with the concat function, which is
not needed to be a neither concatenation. Normally, the update functions are also differentiable
functions such as MLPs. The following equation is the update function for the first example:
\[ \gamma^{(l)} \left(\mathbf{x}_i^{(l-1)}, \bigoplus_{j \in \mathcal{N}(i)} x_j^{(l-1)} \right) = 
\sigma \left(W^{(l)} \cdot 
\operatorname{CONCAT} \left(x_i^{(l-1)}, \bigoplus_{j \in \mathcal{N}(i)} x_j^{(l-1)} \right) + b^{(l)} \right)\]</li>
</ul>
<h3 id="how-to-apply-gnns-to-downstreaming-tasks">How to apply GNNs to downstreaming tasks?</h3>
<p>Since now, we have introduced how to process the data of the nodes with the information of their neighbours. However,
we did not apply the information of the nodes to the downstream tasks. In a lot of cases, the GNNs are used only as
encoders, while decoders or <strong>head networks</strong>, such as MLPs, are used to apply the learned graph representation to
downstreaming tasks such as node classification, edge prediction, or graph classification.</p>
<p>The GNNs will output the node embeddings, we must think another way around to apply downstream tasks. For example, we
can have an MLP to classify the node given the output embeddings given a node
(treating then nodes as instances or rows). We could also apply a function to the embeddings of two nodes to get the
distance between them and, then, predict if they are connected or not. Regarding graph classification, we could apply
global pooling to the embeddings of the nodes to get the final representation of the graph and, then, use an MLP to
classify it.</p>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2026/gnn/downstream-tasks.png#center"
         alt="Downstream tasks in GNNs" width="90%"/> <figcaption>
            <p>Fig. 6. Given the output of the GNN, we can apply downstream tasks such as node, edge, or graph prediction (regression or classification).</p>
        </figcaption>
</figure>

<h2 id="attention-mechanism-in-gnns">Attention mechanism in GNNs</h2>
<p>A GNN with a lot of layers makes the nodes to be more similar to each other, since they will have more global
information than first layers. Although it can be great for some tasks, the embedding nodes tend to be more similar to
each other, vanishing local information. This problem is called <strong>over-smoothing</strong>. A way to check if our GNN
over-smooths the embeddings of the nodes is to calculate the similarity between each embedding and the average
embedding of the nodes. If the similarity of the node and the average is close to 1, it means that the GNN is clearly
over-smoothing.</p>
<div class="callout callout-curiosity" role="note">

  <div class="callout-body">
    <p class="callout-title">
    <svg xmlns="http://www.w3.org/2000/svg" width="18" height="18" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M15 14c.2-1 .7-1.7 1.5-2.5 1-.9 1.5-2.2 1.5-3.5A6 6 0 0 0 6 8c0 1 .2 2.2 1.5 3.5.7.7 1.3 1.5 1.5 2.5"/><path d="M9 18h6"/><path d="M10 22h4"/></svg>
      To over-smooth or not to over-smooth?</p>
    <div class="callout-content">Depending on the case, we do not care if some nodes are &ldquo;over-smoothed&rdquo;, especially those that are central or has
<strong>closeness centrality</strong>. Furthermore, in some downstream tasks, these nodes might be important.</div>
  </div>
</div>

<p>There are different techniques to avoid over-smoothing such as applying skip connections,
similar to U-Net (Ronneberger et al., 2015) or
ResNets (He et al., 2016).
As you could notice, nodes that have more links tend to be more similar to the average embedding.
Specifically, one of the techniques that helps to prevent over-smoothing in these cases is to include <strong>attention</strong>
in the message passing. Therefore, the attention mechanism allows to learn which nodes the network has
to put emphasis on.</p>
<div class="callout callout-seealso" role="complementary">
    <div class="callout-body">
        <p class="callout-title">
                <svg xmlns="http://www.w3.org/2000/svg" width="18" height="18" viewBox="0 0 24 24" fill="none"
                 stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round">
                <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07l-1.72 1.71"/>
                <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 0 0 7.07 7.07l1.71-1.71"/>
            </svg>
            See also
        </p>
        <a href="/posts/20241029-attention/" class="callout-seealso-link">Introduction to Attention Mechanism and Transformers <span aria-hidden="true">→</span></a>
    </div>
</div>

<p>There are two different attention mechanisms that became popular in the literature: <strong>GAT</strong> (Veličković et al., 2018)
and <strong>GATv2</strong> (Brody et al., 2022). The
difference between them is where they apply the attention mechanism.</p>
<h3 id="graph-attention-network-gat">Graph Attention Network (GAT)</h3>
<p>Computes the attention weights once per training loop by using individual node and neighborhood features, being static
across all layers.</p>
<p>In standard GAT, the unnormalized attention score is usually written as:</p>
\[
e_{i, j} = \text{LeakyReLU} (a^T [W \cdot h_i || W \cdot h_j])
\]<p>where \(a\) is the learnable attention vector, and \(W\) another learnable parameter. We can have also two different
learnable parameters \(W_s\) (for the source \(h_i\)) and \(W_t\) (for the target \(h_j\)).</p>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2026/gnn/gat.png#center"
         alt="Unnormalised Attention Score mechanism in GAT." width="90%"/> <figcaption>
            <p>Fig. 7. Unnormalised Attention Score mechanism in GAT.</p>
        </figcaption>
</figure>

<p>Then, softmax is applied to normalize the attention scores.</p>
\[
\alpha_{i, j} = \text{softmax}_{j \in \mathcal{N}(i)} (e_{i, j})
\]<p>We can express GAT and GATv2 as specific choices of: \(\phi^{(l)}\). For example:</p>
\[
\phi^{(l)}_{\text{GAT}} (x_i^{(l-1)}, x_j^{(l-1)}, e_{i,j}) = \alpha_{i, j} \cdot x_j^{(l-1)}
\]<p>We can also add a learnable parameter to process the attention vector alongside with the neighbour node embedding.</p>
\[
\phi^{(l)}_{\text{GAT}} (x_i^{(l-1)}, x_j^{(l-1)}, e_{i,j}) = \alpha_{i, j} \cdot W^{(l)} \cdot x_j^{(l-1)}
\]<p>The problem found in GAT is subtle: the attention vector can be decomposed as:</p>
\[
a^T [W h_i || W h_j] = a_s^T \; W \; h_i + a_t^T \; W \; h_j
\]<p>For a fixed source node \(i\), the term \(a_s^T \; W \; h_i \) is constant to all its neighbours \(j\). Therefore, the
ranking is mostly determined by \(a_t^T \; W \; h_j \). This kind of attention mechanism is called <strong>static</strong>, since
the ranking of the neighbours does not depend on the source node but only the target node. That means that the
attention could be calculated <strong>before knowing the link between the nodes</strong>. In other words, the attention can be
calculated and cached for the whole graph.</p>
<h3 id="gatv2">GATv2</h3>
<p>GATv2 changes the order of operations in the attention mechanism.</p>
\[
e_{i, j} = a^T \text{LeakyReLU} ( W [ h_i || h_j]) =  a^T \text{LeakyReLU} ( W_s h_i + W_t h_j)
\]<p>As in GAT, we can have two learnable parameters \(W_s\) and \(W_t\) for the attention scores, one for the
source node and one for the target node plus the attention vector \(a\).
Then, we will apply softmax to normalize the attention scores.</p>
\[
\alpha_{i,j} = \text{softmax}_j \left( e_{i, j} \right)
\]<p>The main difference is that the non-linearity is applied before the final attention projection.
This allows the source node \(i\) to influence the ranking of the neighbours \(j\). This small change makes GATv2
<strong>dynamic attention</strong>, showing more expressive attention than GAT while keeping the same parametric cost.
Since this mechanism implies calculating the attention for each link, the attention cannot be calculated for the whole
graph but for each node. Hence, the number of links from the nodes (the order) increases the cost.</p>
<p>Using the generic message-passing equation, we can rewrite the GATv2 attention function as it was with GAT:</p>
\[
\phi^{(l)}_{\text{GATv2}} (x_i^{(l-1)}, x_j^{(l-1)}, e_{i,j}) = \alpha_{i,j} \; \text{LeakyReLU} ( W_s h_i + W_t h_j),
\]<p>but the difference is in the attention score already explained. We can keep \(\gamma^{(l)} \) and \(\bigoplus\)
the same.</p>
<p>It is important to note that the attention mechanism adds new learnable parameters to the model, which can be
significantly increased in the number of parameters. However, the attention mechanism can help not only to prevent
over-smoothing but detect specific important links in the graph.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Graph Neural Networks extend deep learning to data where the relationships between entities are as important as the
entities themselves. Instead of processing nodes independently, GNNs exploit the graph structure through message
passing, allowing each node to update its representation by combining its own features with information coming
from its neighbourhood.</p>
<p>Although GNNs adds a new dimension to deep learning and graph understanding, stacking many message-passing layers can
lead to over-smoothing. Attention mechanisms help address this issue by allowing the model to learn which neighbours
should contribute more strongly to each node update.</p>
<hr>
<h1 id="references">References</h1>
<p>Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., &amp; Monfardini, G. (2009). The graph neural network model. IEEE
Transactions on Neural Networks, 20(1), 61–80. <a href="https://doi.org/10.1109/TNN.2008.2005605" target="_blank" rel="noopener">https://doi.org/10.1109/TNN.2008.2005605</a></p>
<p>He, K., Zhang, X., Ren, S., &amp; Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). IEEE. <a href="https://doi.org/10.1109/CVPR.2016.90" target="_blank" rel="noopener">https://doi.org/10.1109/CVPR.2016.90</a></p>
<p>Ronneberger, O., Fischer, P., &amp; Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In
N. Navab, J. Hornegger, W. M. Wells, &amp; A. F. Frangi (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015 (pp. 234–241). Springer. <a href="https://doi.org/10.1007/978-3-319-24574-4_28" target="_blank" rel="noopener">https://doi.org/10.1007/978-3-319-24574-4_28</a></p>
<p>Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., &amp; Bengio, Y. (2018). Graph attention networks.
International Conference on Learning Representations. <a href="https://openreview.net/forum?id=rJXMpikCZ" target="_blank" rel="noopener">https://openreview.net/forum?id=rJXMpikCZ</a></p>
<p>Brody, S., Alon, U., &amp; Yahav, E. (2022). How attentive are graph attention networks? International Conference on
Learning Representations. <a href="https://openreview.net/forum?id=F72ximsx7C1" target="_blank" rel="noopener">https://openreview.net/forum?id=F72ximsx7C1</a></p>
<hr>
<h1 id="citation">Citation</h1>
<pre tabindex="0"><code>@article{alas2026,
  title   = &#34;Under the Hood of Graph Neural Networks: Message Passing, Over-Smoothing and Attention&#34;,
  author  = &#34;Alàs Cercós, Oriol&#34;,
  journal = &#34;oriolac.github.io&#34;,
  year    = &#34;2026&#34;,
  month   = &#34;June&#34;,
  url     = &#34;https://oriolac.github.io/posts/20260624-gnns/&#34;
}
</code></pre>]]></content:encoded></item></channel></rss>