pytorch hook 钩子

简介

hook是钩子，主要作用是不修改主代码，能通过挂载钩子实现额外功能。 pytorch中，主体就是forward和backward，而额外的功能就是对模型的变量进行操作，如“提取”特征图，“提取”非叶子张量的梯度，修改张量梯度等等。hook功能即不必改变网络输入输出的结构，就能方便地获取、改变网络中间层变量的值和梯度。这个功能被广泛用于可视化神经网络中间层的 feature、gradient。

tensor的hook；对module的前向、反向hook，一般来说共有三种hook。

下面的计算图中，x y w 为叶子节点，而 z 为中间变量

pytorch的计算图中只有输出对叶子结点变量的梯度被保存下来，所有中间变量的梯度只被用于反向传播，一旦完成反向传播，中间变量的梯度就将自动释放(虽然 requires_grad 的参数都是 True)，从而节约内存。获取中间节点梯度还可以用 retain_grad()，但这样也会会增加内存占用。

torch.Tensor.register_hook(hook_fn)

hook_fn(grad) -> Tensor or None ，其中grad就是这个tensor的梯度。

hook_fn是我们自定义的函数，假设对上图中间节点z的hook_fn函数来说，输入为变量 z 的梯度，输出为一个 Tensor 或者是 None （None 一般用于直接打印梯度）。反向传播时，梯度传播到变量 z，再继续向前传播之前，将会传入 hook_fn。如果hook_fn的返回值是 None，那么梯度将不改变，继续向前传播，如果 hook_fn的返回值是 Tensor 类型，则该 Tensor 将取代 z 原有的梯度，向前传播。改变中间变量的梯度，之前变量的梯度也会收到影响（变量x,y）。

功能：注册一个反向传播hook_fn函数，这个函数是Tensor类里的，当计算tensor的梯度时自动执行。

为什么是backward？因为这个hook是针对tensor的，tensor中的什么东西会在计算结束后释放呢？只有gradient嘛，所以是 backward hook.

应用场景举例：在hook_fn函数中可对梯度grad进行in-place操作，即可修改tensor的grad值。

下面是一个保存中间节点grad的简单例子：

（注意修改中间节点的梯度后，该节点之前变量的梯度也会受到链式法则的影响）

import torch

def grad_hook(grad):
    y_grad.append(grad)

y_grad = list()
x = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True)
y = x+1
y.register_hook(grad_hook)

out = torch.mean(y*y)
out.backward()
print("type(y): ", type(y))
print("y.grad: ", y.grad)
print("y_grad[0]: ", y_grad[0])


>>> ('type(y): ', <class 'torch.Tensor'>)
>>> ('y.grad: ', None)
>>> ('y_grad[0]: ', tensor([[1.0000, 1.5000],
        [2.0000, 2.5000]]))

上述代码中，x是叶子结点，y是中间节点，反向传播完成，out对y的梯度y.grad=None 证明中间节点梯度被释放。

而通过自定义的hook函数：grad_hook 把y的梯度保存到全局变量：y_grad = list()中。因此可以在out.backward()结束后，仍旧可以在y_grad[0]中读到y的梯度为tensor([0.2500, 0.2500, 0.2500, 0.2500])

下面是一个修改grad的hook：

w = torch.tensor([1.], requires_grad=True)
x = torch.tensor([2.], requires_grad=True)
a = torch.add(w, x)
b = torch.add(w, 1)
y = torch.mul(a, b)

a_grad = list()

def grad_hook(grad):
    grad *= 2
    return grad*3

handle = w.register_hook(grad_hook)

y.backward()

# 查看梯度
print("w.grad: ", w.grad)
handle.remove()　　
# w.grad:  tensor([30.])=5*2*3

handle.remove(): a handle that can be used to remove the added hook by calling handle.remove()

在实际代码中，为了方便，也可以用 lambda 表达式来代替函数，简写为如下形式：

torch.Tensor.register_hook(lambda x: 2*x)　　# 输入grad，返回2*grad，修改梯度值为原来的2倍，注意修改中间节点的梯度后，之前的梯度也会受到链式法则的影响

torch.Tensor.register_hook(lambda x: print(x))

一个变量可以绑定多个 hook_fn，反向传播时，它们按绑定顺序依次执行。

下面介绍网络模块的hook：

网络模块 module 不像上一节中的 Tensor，拥有显式的变量名可以直接访问，而是被封装在神经网络中间。我们通常只能获得网络整体的输入和输出，对于夹在网络中间的模块，我们不但很难得知它输入/输出的梯度，甚至连它输入输出的数值都无法获得。除非设计网络时，在 forward 函数的返回值中包含中间 module 的输出，或者用很麻烦的办法，把网络按照 module 的名称拆分再组合，让中间层提取的 feature 暴露出来。

为了解决这个麻烦，PyTorch 设计了两种 hook：register_forward_hook 和register_backward_hook，分别用来获取正/反向传播时，中间层模块输入和输出的 feature/gradient，大大降低了获取模型内部信息流的难度。

nn.Module.register_forward_hook(hook_fn)

hook_fn(module, input, output) -> None。注意不能修改input和output

　　Module前向传播中的hook, module在前向传播后，自动调用hook_fn函数。作用是获取前向传播过程中，各个网络模块的输入和输出

　　hook_fn函数的输入变量分别为：模块，模块的输入，模块的输出，和对 Tensor 的 hook 不同，forward hook 不返回任何值，也就是说不能用它来修改输入或者输出的值，但借助这个 hook，我们可以方便地用预训练的神经网络提取特征，而不用改变预训练网络的结构。如用于提取特征图。

import torch
from torch import nn

# 首先我们定义一个模型
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(3, 4)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(4, 1)
        self.initialize()
    
    # 为了方便验证，我们将指定特殊的weight和bias
    def initialize(self):
        with torch.no_grad():
            self.fc1.weight = torch.nn.Parameter(
                torch.Tensor([[1., 2., 3.],
                              [-4., -5., -6.],
                              [7., 8., 9.],
                              [-10., -11., -12.]]))

            self.fc1.bias = torch.nn.Parameter(torch.Tensor([1.0, 2.0, 3.0, 4.0]))
            self.fc2.weight = torch.nn.Parameter(torch.Tensor([[1.0, 2.0, 3.0, 4.0]]))
            self.fc2.bias = torch.nn.Parameter(torch.Tensor([1.0]))

    def forward(self, x):
        o = self.fc1(x)
        o = self.relu1(o)
        o = self.fc2(o)
        return o

# 全局变量，用于存储中间层的 feature
total_feat_out = []
total_feat_in = []

# 定义 forward hook function
def hook_fn_forward(module, input, output):
    print(module) # 用于区分模块
    print('input', input) # 首先打印出来
    print('output', output)
    total_feat_out.append(output) # 然后分别存入全局 list 中
    total_feat_in.append(input)


model = Model()

modules = model.named_children() #
for name, module in modules:
    module.register_forward_hook(hook_fn_forward)
    # module.register_backward_hook(hook_fn_backward)

# 注意下面代码中 x 的维度，对于linear module，输入一定是大于等于二维的
# （第一维是 batch size）。

x = torch.Tensor([[1.0, 1.0, 1.0]]).requires_grad_() 
o = model(x)
o.backward()

print('==========Saved inputs and outputs==========')
for idx in range(len(total_feat_in)):
    print('input: ', total_feat_in[idx])
    print('output: ', total_feat_out[idx])

nn.Module.register_backward_hook(hook_fn)

和register_forward_hook相似，register_backward_hook 的作用是获取神经网络反向传播过程中，各个模块输入端和输出端的梯度值。其中hook_fn的函数签名为：

hook_fn(module, grad_input, grad_output) -> Tensor or None

它的输入变量分别为：模块，模块输入端的梯度，模块输出端的梯度。需要注意的是，这里的输入端和输出端，是站在前向传播的角度的，而不是反向传播的角度。例如线性模块：o=W*x+b，其输入端为 W，x 和 b，输出端为 o。能观察得到：后一层的grad_input 和前一层的grad_output有关联（可能相同）

如果模块有多个输入或者输出的话，grad_input和grad_output可以是 tuple 类型。对于线性模块：o=W*x+b ，它的输入端包括了W、x 和 b 三部分，因此 grad_input 就是一个包含三个元素的 tuple。

这里注意和 forward hook 的不同：

1.在 forward hook 中，input 是 x，而不包括 W 和 b。而 backward hook 的 input 则是 b, W, x 三者的梯度。

2.返回 Tensor 或者 None，backward hook 函数不能直接改变它的输入变量，但是可以返回新的 grad_input，反向传播到它上一个模块。

下面是例子：

import torch
from torch import nn

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(3, 4)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(4, 1)
        self.initialize()

    def initialize(self):
        with torch.no_grad():
            self.fc1.weight = torch.nn.Parameter(
                torch.Tensor([[1., 2., 3.],
                              [-4., -5., -6.],
                              [7., 8., 9.],
                              [-10., -11., -12.]]))

            self.fc1.bias = torch.nn.Parameter(torch.Tensor([1.0, 2.0, 3.0, 4.0]))
            self.fc2.weight = torch.nn.Parameter(torch.Tensor([[10.0, 20.0, 30.0, 40.0]]))
            self.fc2.bias = torch.nn.Parameter(torch.Tensor([2.0]))

    def forward(self, x):
        o = self.fc1(x)
        o = self.relu1(o)
        o = self.fc2(o)
        return o

total_grad_out = []
total_grad_in = []

def hook_fn_backward(module, grad_input, grad_output):
    print(module) # 为了区分模块
    # 为了符合反向传播的顺序，我们先打印 grad_output
    print('grad_output', grad_output) 
    # 再打印 grad_input
    print('grad_input', grad_input)
    # 保存到全局变量
    total_grad_in.append(grad_input)
    total_grad_out.append(grad_output)

model = Model()

modules = model.named_children()
for name, module in modules:
    module.register_backward_hook(hook_fn_backward)

# 这里的 requires_grad 很重要，如果不加，backward hook 执行到第一层，对 x 的导数将为 None，
# 此外再强调一遍 x 的维度，一定不能写成 torch.Tensor([1.0, 1.0, 1.0]).requires_grad_() 否则 backward hook 会出问题。
x = torch.Tensor([[1.0, 1.0, 1.0]]).requires_grad_()
o = model(x)
o.backward()

print('==========Saved inputs and outputs==========')
for idx in range(len(total_grad_in)):
    print('grad output: ', total_grad_out[idx])
    print('grad input: ', total_grad_in[idx])

输出：

----------------------分割线-----------------------
Linear(in_features=4, out_features=1, bias=True)
grad_output (tensor([[1.]]),)
grad_input (tensor([1.]), tensor([[10., 20., 30., 40.]]), tensor([[ 7.],
        [ 0.],
        [27.],
        [ 0.]]))
ReLU()
grad_output (tensor([[10., 20., 30., 40.]]),)
grad_input (tensor([[10.,  0., 30.,  0.]]),)
Linear(in_features=3, out_features=4, bias=True)
grad_output (tensor([[10.,  0., 30.,  0.]]),)
grad_input (tensor([10.,  0., 30.,  0.]), tensor([[220., 260., 300.]]), tensor([[10.,  0., 30.,  0.],
        [10.,  0., 30.,  0.],
        [10.,  0., 30.,  0.]]))
==========Saved inputs and outputs==========
grad input:  (tensor([1.]), tensor([[10., 20., 30., 40.]]), tensor([[ 7.],
        [ 0.],
        [27.],
        [ 0.]]))
grad input:  (tensor([[10.,  0., 30.,  0.]]),)
grad input:  (tensor([10.,  0., 30.,  0.]), tensor([[220., 260., 300.]]), tensor([[10.,  0., 30.,  0.],
        [10.,  0., 30.,  0.],
        [10.,  0., 30.,  0.]]))

设z=x*W1+b1，c=ReLu(z), y=c*W2+b2。可以根据链式求导法则自行验证梯度。

backward hook 是按反向传播顺序调用模块对应的hook，这里要注意一下。所以结果是先打印fc2、再relu1、最后fc1。ReLu函数的导数(输入>0?1:0)

个人理解：

y对 Linear 的 grad_input 则是分别对 b1，x，W1的gradient

y对 Linear 的 grad_output 即是对z的gradient

参考博客

pytorch hook 钩子

简介

torch.Tensor.register_hook(hook_fn)

nn.Module.register_forward_hook(hook_fn)

nn.Module.register_backward_hook(hook_fn)

继续阅读

考证大全 | 证券从业资格考试

敲黑板！2021年证券从业考试考点预测

2021年银行从业考试考情介绍,果断收藏!

证券从业合格证书什么时候打印？有哪些注意事项？

【干货满满】初级银行从业考试《个人理财》重点梳理

2020年经济师考试，难吗？

初级银行从业资格证有什么用？

MBA提前面试纯干货分享

MBA值得学么

吴恩达logistic回归实现

【人工智能行业大师访谈1】吴恩达采访 Geoffery Hinton

深度学习模型分析人类复杂疾病的准确性

【趋高机器视觉】机器视觉技术原理解析及解决方案

解码器用于语义分割：数据依赖的解码可以实现灵活的特征聚合

cs231n斯坦福基于卷积神经网络的CV学习笔记（一）KNN和线性分类器/分类器损失/反向传播一，KNN图像分类算法二，线性分类器三，线性分类器损失四，反向传播五，神经网络

【Torch】最简洁logging使用指南