圖神經網絡(gnn)是一類功能強大的神經網絡，它對圖結構資料進行操作。它們通過從節點的局部鄰域聚合資訊來學習節點表示(嵌入)。這個概念在圖表示學習文獻中被稱為“消息傳遞”。

消息(嵌入)通過多個GNN層在圖中的節點之間傳遞。每個節點聚合來自其鄰居的消息以更新其表示。這個過程跨層重複，允許節點獲得編碼有關圖的更豐富資訊的表示。gnn的一主要變體有GraphSAGE[2]、Graph Convolution Network[3]等。

圖注意力網絡(GAT)[1]是一類特殊的gnn，主要的改進是消息傳遞的方式。他們引入了一種可學習的注意力機制，通過在每個源節點和目标節點之間配置設定權重，使節點能夠在聚合來自本地鄰居的消息時決定哪個鄰居節點更重要，而不是以相同的權重聚合來自所有鄰居的資訊。

圖注意力網絡在節點分類、連結預測和圖分類等任務上優于許多其他GNN模型。他們在幾個基準圖資料集上也展示了最先進的性能。

在這篇文章中，我們将介紹原始“Graph Attention Networks”（by Veličković ）論文的關鍵部分，并使用PyTorch實作論文中提出的概念，這樣以更好地掌握GAT方法。

論文引言

在第1節“引言”中對圖表示學習文獻中的現有方法進行了廣泛的回顧之後，論文介紹了圖注意網絡(GAT)。

然後将論文的方法與現有的一些方法進行比較，并指出它們之間的一般異同，這是論文的常用格式，就不多介紹了。

GAT的架構

本節是本文的主要部分，對圖注意力網絡的體系結構進行了詳細的闡述。為了進一步解釋，假設所提出的架構在一個有N個節點的圖上執行(V = {V′};i=1，…，N)，每個節點用向量h ^ (F個元素)表示，節點之間存在任意邊。

作者首先描述了單個圖注意力層的特征，以及它是如何運作的（因為它是圖注意力網絡的基本建構塊）。一般來說，單個GAT層應該将具有給定節點嵌入(表示)的圖作為輸入，将資訊傳播到本地鄰居節點，并輸出更新後的節點表示。

如上所述，ga層的所有輸入節點特征向量(h′)都是線性變換的(即乘以一個權重矩陣W)，在PyTorch中，通常是這樣做的:

import torch
from torch import nn
# in_features -> F and out_feature -> F'
in_features = ...
out_feature = ...
# instanciate the learnable weight matrix W (FxF')
W = nn.Parameter(torch.empty(size=(in_features, out_feature)))
# Initialize the weight matrix W
nn.init.xavier_normal_(W)
# multiply W and h (h is input features of all the nodes -> NxF matrix)
h_transformed = torch.mm(h, W)

獲得了輸入節點特征(嵌入)的轉換版本後我們先跳到最後檢視和了解GAT層的最終目标是什麼。

如論文所述，在圖注意層的最後，對于每個節點i，我們需要從其鄰域獲得一個新的特征向量，該特征向量更具有結構和上下文感覺性。

這是通過計算相鄰節點特征的權重和，然後是非線性激活函數σ來完成的。根據Graph ML文獻，這個權重和在一般GNN層操作中也被稱為“聚合”步驟。

論文的這些權重α′ⱼ∈[0,1]是通過一種關注機制來學習和計算的，該機制表示在消息傳遞和聚合過程中節點i的鄰居j特征的重要性。

每一對節點i和它的鄰居j計算這些注意權值α′ⱼ的計算方法如下

其中e ^ⱼ是注意力得分，在應用Softmax函數後，有權重都會在[0,1]區間内，并且和為1。現在通過注意函數a(…)計算每個節點i和它的鄰居j∈N′之間的注意分數e′ⱼ，如下所示:

上圖中的||表示兩個轉換後的節點嵌入的連接配接，a是大小為2 * F '(轉換後嵌入大小的兩倍)的可學習參數(即注意力參數)向量。而(a¹)是向量a的轉置，導緻整個表達式a¹[Wh′|| Whⱼ]是“a”與轉換後的嵌入的連接配接之間的點(内)積。

整個操作說明如下:

在PyTorch中，我們采用了一種稍微不同的方法。因為計算所有節點對之間的e′ⱼ然後隻選擇代表節點之間現有邊的那些是更有效的。來計算所有的e′ⱼ

# instanciate the learnable attention parameter vector `a`
a = nn.Parameter(torch.empty(size=(2 * out_feature, 1)))
# Initialize the parameter vector `a`
nn.init.xavier_normal_(a)
# we obtained `h_transformed` in the previous code snippet
# calculating the dot product of all node embeddings
# and first half the attention vector parameters (corresponding to neighbor messages)
source_scores = torch.matmul(h_transformed, self.a[:out_feature, :])
# calculating the dot product of all node embeddings
# and second half the attention vector parameters (corresponding to target node)
target_scores = torch.matmul(h_transformed, self.a[out_feature:, :])
# broadcast add 
e = source_scores + target_scores.T
e = self.leakyrelu(e)

代碼片段的最後一部分(# broadcast add)将所有一對一的源和目标分數相加，得到一個包含所有e′ⱼ分數的NxN矩陣。(下圖所示)

到目前為止，我們假設圖是完全連接配接的，我們計算的是所有可能的節點對之間的注意力得分。但是其實大部分情況下圖不可能是完全連接配接的，是以為了解決這個問題，在将LeakyReLU激活應用于注意力分數之後，注意力分數基于圖中現有的邊被屏蔽，這意味着我們隻保留與現有邊對應的分數。

它可以通過給不存在邊的節點之間的分數矩陣中的元素配置設定一個大的負分數(近似于-∞)來完成，這樣它們對應的注意力權重在softmax之後變為零（還記得我們以前發的注意力掩碼麼，就是一樣的道理）。

這裡的注意力掩碼是通過使用圖的鄰接矩陣來實作的。鄰接矩陣是一個NxN矩陣，如果節點i和j之間存在一條邊，則在第i行和第j列處為1，在其他地方為0。是以，我們通過将鄰接矩陣的零元素指派為-∞并在其他地方指派為0來建立掩碼。然後将掩碼添加到分數矩陣中。然後在它的行上應用softmax函數。

connectivity_mask = -9e16 * torch.ones_like(e)
# adj_mat is the N by N adjacency matrix
e = torch.where(adj_mat > 0, e, connectivity_mask) # masked attention scores

# attention coefficients are computed as a softmax over the rows
# for each column j in the attention score matrix e
attention = F.softmax(e, dim=-1)

最後，根據論文描述，在獲得注意力分數并将其與現有的邊進行掩碼遮蔽後，通過對分數矩陣的行執行softmax，得到注意力權重α¹ⱼ。

我們通過一個完整的可視化圖過程如下：

最後就是計算節點嵌入的權重和:

# final node embeddings are computed as a weighted average of the features of its neighbors
h_prime = torch.matmul(attention, h_transformed)

以上一個一個注意力頭的工作流程和原理，論文還引入了多頭的概念，其中所有操作都是通過多個并行的操作流來完成的。

多頭注意力和聚合過程如下圖所示:

節點1在其鄰域中的多頭注意力(K = 3個頭)，不同的箭頭樣式和顔色表示獨立的注意力計算。将來自每個頭部的聚合特征連接配接或平均以獲得h '。

為了以更簡潔的子產品化形式(作為PyTorch子產品)封裝實作并合并多頭注意力的功能，整個Graph關注層的實作如下:

import torch
from torch import nn
import torch.nn.functional as F
################################
### GAT LAYER DEFINITION ###
################################
class GraphAttentionLayer(nn.Module):
def __init__(self, in_features: int, out_features: int,
n_heads: int, concat: bool = False, dropout: float = 0.4,
leaky_relu_slope: float = 0.2):
super(GraphAttentionLayer, self).__init__()
self.n_heads = n_heads # Number of attention heads
self.concat = concat # wether to concatenate the final attention heads
self.dropout = dropout # Dropout rate
if concat: # concatenating the attention heads
self.out_features = out_features # Number of output features per node
assert out_features % n_heads == 0 # Ensure that out_features is a multiple of n_heads
self.n_hidden = out_features // n_heads
else: # averaging output over the attention heads (Used in the main paper)
self.n_hidden = out_features
# A shared linear transformation, parametrized by a weight matrix W is applied to every node
# Initialize the weight matrix W 
self.W = nn.Parameter(torch.empty(size=(in_features, self.n_hidden * n_heads)))
# Initialize the attention weights a
self.a = nn.Parameter(torch.empty(size=(n_heads, 2 * self.n_hidden, 1)))
self.leakyrelu = nn.LeakyReLU(leaky_relu_slope) # LeakyReLU activation function
self.softmax = nn.Softmax(dim=1) # softmax activation function to the attention coefficients
self.reset_parameters() # Reset the parameters
def reset_parameters(self):
nn.init.xavier_normal_(self.W)
nn.init.xavier_normal_(self.a)
def _get_attention_scores(self, h_transformed: torch.Tensor):

source_scores = torch.matmul(h_transformed, self.a[:, :self.n_hidden, :])
target_scores = torch.matmul(h_transformed, self.a[:, self.n_hidden:, :])
# broadcast add 
# (n_heads, n_nodes, 1) + (n_heads, 1, n_nodes) = (n_heads, n_nodes, n_nodes)
e = source_scores + target_scores.mT
return self.leakyrelu(e)
def forward(self, h: torch.Tensor, adj_mat: torch.Tensor):
n_nodes = h.shape[0]
# Apply linear transformation to node feature -> W h
# output shape (n_nodes, n_hidden * n_heads)
h_transformed = torch.mm(h, self.W)
h_transformed = F.dropout(h_transformed, self.dropout, training=self.training)
# splitting the heads by reshaping the tensor and putting heads dim first
# output shape (n_heads, n_nodes, n_hidden)
h_transformed = h_transformed.view(n_nodes, self.n_heads, self.n_hidden).permute(1, 0, 2)

# getting the attention scores
# output shape (n_heads, n_nodes, n_nodes)
e = self._get_attention_scores(h_transformed)
# Set the attention score for non-existent edges to -9e15 (MASKING NON-EXISTENT EDGES)
connectivity_mask = -9e16 * torch.ones_like(e)
e = torch.where(adj_mat > 0, e, connectivity_mask) # masked attention scores

# attention coefficients are computed as a softmax over the rows
# for each column j in the attention score matrix e
attention = F.softmax(e, dim=-1)
attention = F.dropout(attention, self.dropout, training=self.training)
# final node embeddings are computed as a weighted average of the features of its neighbors
h_prime = torch.matmul(attention, h_transformed)
# concatenating/averaging the attention heads
# output shape (n_nodes, out_features)
if self.concat:
h_prime = h_prime.permute(1, 0, 2).contiguous().view(n_nodes, self.out_features)
else:
h_prime = h_prime.mean(dim=0)
return h_prime

最後将上面所有的代碼整合成一個完整的GAT模型：

class GAT(nn.Module):
def __init__(self,
in_features,
n_hidden,
n_heads,
num_classes,
concat=False,
dropout=0.4,
leaky_relu_slope=0.2):
super(GAT, self).__init__()
# Define the Graph Attention layers
self.gat1 = GraphAttentionLayer(
in_features=in_features, out_features=n_hidden, n_heads=n_heads,
concat=concat, dropout=dropout, leaky_relu_slope=leaky_relu_slope
)

self.gat2 = GraphAttentionLayer(
in_features=n_hidden, out_features=num_classes, n_heads=1,
concat=False, dropout=dropout, leaky_relu_slope=leaky_relu_slope
)
def forward(self, input_tensor: torch.Tensor , adj_mat: torch.Tensor):
# Apply the first Graph Attention layer
x = self.gat1(input_tensor, adj_mat)
x = F.elu(x) # Apply ELU activation function to the output of the first layer
# Apply the second Graph Attention layer
x = self.gat2(x, adj_mat)
return F.softmax(x, dim=1) # Apply softmax activation function

方法對比

作者對GATs和其他一些現有GNN方法/架構進行了比較：

由于GATs能夠計算注意力權重并并行執行局部聚合，是以它比現有的一些方法計算效率更高。
GATs可以在聚合消息時為節點的鄰居配置設定不同的重要性，這可以實作模型容量的飛躍并提高可解釋性。
GAT不考慮節點的完整鄰域(不需要從鄰域采樣)，也不假設節點内部有任何排序。
通過将僞坐标函數設定為u(x, y) = f(x)||f(y)， GAT可以重新表述為MoNet的一個特定執行個體(Monti等人，2016)，其中f(x)表示(可能是mlp轉換的)節點x的特征，而||是連接配接;權函數為wj(u) = softmax(MLP(u))

基準測試

在論文的第三部分中，作者描述了評估GAT的基準、資料集和任務。然後，他們提出了他們對模型的評估結果。

論文中用作基準的資料集分為兩種類型的任務，轉換和歸納。

歸納學習:這是一種監督學習任務，其中模型僅在一組标記的訓練樣例上進行訓練，并且在訓練過程中完全未觀察到的樣例上對訓練後的模型進行評估和測試。這是一種被稱為普通監督學習的學習類型。

傳導學習:在這種類型的任務中，所有的資料，包括訓練、驗證和測試執行個體，都在訓練期間使用。但是在每個階段，模型隻通路相應的标簽集。這意味着在訓練期間，模型隻使用由訓練執行個體和标簽産生的損失進行訓練，但測試和驗證特征用于消息傳遞。這主要是因為示例中存在的結構和上下文資訊。

論文使用四個基準資料集來評估GATs，其中三個對應于傳導學習，另一個用作歸納學習任務。

轉導學習資料集，即Cora、Citeseer和Pubmed (Sen et al.， 2008)資料集都是引文圖，其中節點是已釋出的文檔，邊(連接配接)是它們之間的引用，節點特征是文檔的詞包表示的元素。

歸納學習資料集是一個蛋白質-蛋白質互相作用(PPI)資料集，其中包含不同人體組織的圖形(Zitnik & Leskovec, 2017)。資料集的較長的描述如下:

作者報告了四個基準測試的以下性能，顯示了GATs與現有GNN方法的可比結果。

總結

通過閱讀這篇文章并試用代碼，希望你能夠對GATs的工作原理以及如何在實際場景中應用它們有一個紮實的了解。

最後還有引用

[1] — Graph Attention Networks (2017), Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengio. arXiv:1710.10903v3

[2] — Inductive Representation Learning on Large Graphs (2017), William L. Hamilton, Rex Ying, Jure Leskovec. arXiv:1706.02216v4

[3] — Semi-Supervised Classification with Graph Convolutional Networks (2016), Thomas N. Kipf, Max Welling. arXiv:1609.02907v4

作者：Ebrahim Pichka

圖注意力網絡論文詳解和PyTorch實作

論文引言

GAT的架構

方法對比

基準測試

總結