常用的激活函數Sigmoid,ReLU,Swish,Mish,GELU

sigmoid

Sigmoid激活函數在我們的網絡模型中比較常用，也常作為二分類任務的輸出層，函數的輸出範圍為（0 ,1）

表達式：

σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1} {1+e^{-z}} σ(z)=1+e−z1

其導數：

σ ′ ( z ) = 0 − 1 ⋅ ( − e − z ) ( 1 + e − z ) 2 = e − z ( 1 + e − z ) 2 = e − z ( 1 + e − z ) ⋅ 1 ( 1 + e − z ) = 1 + e − z − 1 ( 1 + e − z ) ⋅ σ ( z ) = ( 1 − σ ( z ) ) ⋅ σ ( z ) \sigma'(z) = \frac{0-1 \cdot(-e^{-z})} {(1+e^{-z})^2} = \frac{e^{-z}} {(1+e^{-z})^2} = \frac{e^{-z}} {(1+e^{-z})} \cdot \frac{1} {(1+e^{-z})} = \frac{1+e^{-z}-1} {(1+e^{-z})} \cdot \sigma{(z)} = (1-\sigma(z)) \cdot \sigma(z) σ′(z)=(1+e−z)20−1⋅(−e−z)=(1+e−z)2e−z=(1+e−z)e−z⋅(1+e−z)1=(1+e−z)1+e−z−1⋅σ(z)=(1−σ(z))⋅σ(z)

sigmoid圖像如下：

常用的激活函數Sigmoid,ReLU,Swish,Mish,GELU

優點：

平滑、易于求導

缺點：

會有梯度消失
函數不是關于原點對稱
計算exp比較費時

tanh

tanh為雙曲正切函數，函數輸出範圍為(-1, 1)。

表達式：

tanh ⁡ ( x ) = e x − e − x e x + e − x \tanh(x) = \frac{e^x - e^{-x}} {e^x + e^{-x}} tanh(x)=ex+e−xex−e−x

其圖像如下圖所示，可以看做是sigmoid函數的向下平移和拉伸，（圖來自：激活函數總結）。

常用的激活函數Sigmoid,ReLU,Swish,Mish,GELU

tanh激活函數的特點：

相比Sigmoid函數：

tanh函數輸出範圍是(-1, 1)，解決了Sigmoid函數不是關于0點中心對稱的問題；
exp計算量大的問題依然存在；
相比于Sigmoid，梯度消失的問題得到一定的緩解，但仍然存在。

ReLU

ReLU激活函數中文名叫修正線性單元函數。

公式：

f ( x ) = m a x ( 0 , x ) \ f(x)=max(0, x) f(x)=max(0,x)

函數曲線：

常用的激活函數Sigmoid,ReLU,Swish,Mish,GELU

優點：

解決了梯度消失問題，收斂快于Sigmoid和tanh，但要防範ReLU的梯度爆炸；
相比Sigmoid和tanh，ReLU計算簡單，提高了運算速度；
容易得到更好的模型。

缺點：

輸入負數時，ReLU輸出總是0，神經元不被激活。

ReLU函數的變型

常用的激活函數Sigmoid,ReLU,Swish,Mish,GELU

Leaky ReLU

函數中的a為常數，一般設定為0.01
PReLU

函數中a作為一個可學習的參數，會在訓練過程中更新

Swish

Swish激活函數具備無上界有下屆、平滑、非單調的特性，Swish在深層模型上效果優于ReLU。

表達式：

s w i s h ( x ) = x ⋅ s i g m o i d ( β x ) swish(x) = x \cdot sigmoid(\beta x) swish(x)=x⋅sigmoid(βx)

β是個常數或者可訓練的參數。

常用的激活函數Sigmoid,ReLU,Swish,Mish,GELU

hard-Swish

該激活函數在MobileNetV3論文中提出，相較于swish函數，具有數值穩定性好，計算速度快等優點。

h − s w i s h ( x ) = x R e L U 6 ( x + 3 ) 6 \ h-swish(x) = x \frac{ReLU6(x+3)} {6} h−swish(x)=x6ReLU6(x+3)

class Hswish(nn.Module):
    def __init__(self, inplace=True):
        super(Hswish, self).__init__()
        self.inplace = inplace

    def forward(self, x):
        return x * F.relu6(x + 3., inplace=self.inplace) / 6.

Mish

表達式：

M i s h = x ⋅ t a n h ( l n ( 1 + e x ) ) \ Mish = x \cdot tanh(ln(1+e^x)) Mish=x⋅tanh(ln(1+ex))

常用的激活函數Sigmoid,ReLU,Swish,Mish,GELU

class Mish(nn.Module):
    def __init__(self):
        super(Mish, self).__init__()

    def forward(self, x):
        return x * torch.tanh(F.softplus(x))

GELU

GELU叫叫高斯誤差線性單元，這種激活函數加入了随機正則的思想，是一種對神經元輸入的機率描述。公式如下：

G E L U ( x ) = x P ( X < = x ) = x Φ ( x ) \ GELU(x) = xP(X <= x) = x \Phi(x) GELU(x)=xP(X<=x)=xΦ(x)

其中Φ(x)指的是正态分布的機率函數。

對于假設為标準正态分布的GELU(x)，論文中提供了近似計算的數學公式，如下：

G E L U ( x ) = 0.5 x ( 1 + t a n h ( 2 π ( x + 0.044715 x 3 ) ) ) \ GELU(x) = 0.5x(1+tanh(\sqrt{\frac {2}{\pi}}(x+0.044715x^3))) GELU(x)=0.5x(1+tanh(π2

(x+0.044715x3)))

代碼：

def gelu(x):
    """Implementation of the gelu activation function.
        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
        Also see https://arxiv.org/abs/1606.08415
    """
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))

參考：https://zhuanlan.zhihu.com/p/73214810

參考：https://www.cnblogs.com/makefile/p/activation-function.html

常用的激活函數Sigmoid,ReLU,Swish,Mish,GELU

sigmoid

tanh

ReLU

Swish

hard-Swish

Mish

GELU

繼續閱讀

華為又對計算機視覺下手了！

回顧人工智能大爆炸的引爆點（The origin of the AI big bang）

alphaGo的前世今生，并不那麼玄乎

AlphaGo：從直覺學習到整體知識

AI修複後的王祖賢和林青霞，真的是人間絕色啊

AI空前火爆，“智能時代”真的到來了嗎？

【初創公司系列】Runway - 機器學習和人工智能徹底改變藝術與創意世界

AI—資料中毒

英國最新報告：40% AI公司其實沒用任何AI技術40%的AI公司是假AI？打上AI标簽，投資能多拿一半AI技術落地應用榜

阿裡文娛永叔：利器or成本損耗？算法不是黑匣子

感覺機(Perceptron) Python實作scikit-learn 方法

沒錯! AI已經開始“引領”服裝潮流趨勢了

用Matlab搞計算機視覺是怎樣的體驗？

人工智能有朝一日真的能取代人類教師嗎?

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

人工智能如何有效地運用于自然語言處理