基于深度學習的文本分類
word2vec
word2vec模型的基本思想是對出現在上下文環境裡的詞進行預測。對于每一條輸入文本,選取一個上下文視窗和一個中心詞,并基于這個中心詞去預測視窗裡其他詞出現的機率。是以,word2vec模型可以友善地從新增語料中學習到新增詞的向量表達。
word2vec的主要思路:通過單詞的上下文彼此預測,對應的兩個算法分别為:
- Skip-grams(SG):預測上下文
- Continuous Bags of Words(CBOW):預測目标單詞
從直覺上了解,Skip-Gram是給定input word來預測上下文,而CBOW是給定上下文來預測input word。
word2vec分為兩個部分,第一部分為建立模型,第二部分是通過模型擷取詞向量。word2vec的整個模組化過程與自編碼器(auto-encoder)的思想類似,即基于訓練資料建構一個神經網絡,得到隐層的權重矩陣,把權重作為"word vectors"。
此外word2vec模型還提出了兩種更加高效的訓練方法:
- Hierarchical softmax
- Negative sampling
Skip-grams原理和網絡結構
Skip-grams過程
假設有一個句子"the dog barked at the mailman"
- 首先選擇一個詞作為輸入詞,例如選取"dog"作為input word
- 定義參數skip_window,skip_window表示從目前input word的一側 (左邊或右邊) 選取詞的數量。如果設定skip_window=2,則選取左側2個詞和右側2個詞進入視窗,整個視窗大小span=4,獲得視窗中的詞 (包括input word) 就是 [‘The’,‘dog’,‘barked’,‘at’]。另一個參數為num_skips,代表從整個視窗選取多少個不同的詞作為output word。當skip_window=2,num_skips=2時,我們将會得到兩組 (input word, output word) 形式的訓練資料,即 (‘dog’, ‘barked’),(‘dog’, ‘the’)
- 訓練神經網絡模型,基于這些訓練資料輸出一個機率分布,這個機率代表着詞典中的每個詞作為input word的output word的可能性。第二步在設定skip_window和num_skips=2的情況下獲得了兩組訓練資料。假如先拿一組資料 (‘dog’, ‘barked’) 來訓練神經網絡,那麼模型通過學習這個訓練樣本,會告訴我們詞彙表中每個單詞當’dog’作為input word時,其作為output word的可能性
也就是說模型的輸出機率代表着詞典中每個詞有多大可能性跟input word同時出現。例如:假如向神經網絡模型中輸入一個單詞“Soviet“,那麼最終模型的輸出機率中,像“Union”, ”Russia“這種相關詞的機率将遠高于像”watermelon“,”kangaroo“非相關詞的機率。因為”Union“,”Russia“在文本中更大可能在”Soviet“的視窗中出現。
如下圖所示,標明句子“The quick brown fox jumps over lazy dog”,設定視窗大小為2(window_size=2),即僅選輸入詞前後各兩個詞和輸入詞進行組合。藍色方框代表input word,綠色方框代表位于視窗内的單詞
input word和output word都會進行one-hot編碼。而被one-hot編碼以後大多數次元上都是0(實際上僅有一個位置為1),是以這個向量相當稀疏,直接計算會消耗相當大的計算資源,為了高效計算,模型僅會選擇矩陣中對應的向量中次元值為1的索引行:
Skip-grams訓練
由前文可知,word2vec模型是一個超級大的神經網絡(權重矩陣規模非常大)。例如:假設有一個10000個單詞的詞彙表,如果我們想嵌入300維的詞向量,那麼模型的輸入-隐層權重矩陣和隐層-輸出層的權重矩陣都會有 10000 x 300 = 300萬個權重,在如此龐大的神經網絡中進行梯度下降是相當慢的。更糟糕的是,需要大量的訓練資料來調整這些權重并且避免過拟合。
解決方案:
- 将常見的單詞組合(word pairs)或者詞組作為單個“words”來處理
- 對高頻次單詞進行抽樣來減少訓練樣本的個數
- 對優化目标采用**“negative sampling”方法,這樣每個訓練樣本的訓練隻會更新一小部分的模型權重**,進而降低計算負擔
Word pairs and "phases"
一些單詞組合(或者詞組)的含義和拆開以後具有完全不同的意義。比如“Boston Globe”是一種報刊的名字,而單獨的“Boston”和“Globe”這樣單個的單詞卻表達不出這樣的含義。是以,在文章中隻要出現“Boston Globe”,我們就應該把它作為一個單獨的詞來生成其詞向量,而不是将其拆開。同樣的例子還有“New York”,“United Stated”等。
對高頻詞抽樣
以前文中的例子“The quick brown fox jumps over the laze dog”為例,對于“the”這種常用高頻單詞的處理會存在下面兩個問題:
- 當我們得到成對的單詞訓練樣本時,(“fox”, “the”) 這樣的訓練樣本并不會給我們提供關于“fox”更多的語義資訊,因為“the”在每個單詞的上下文中幾乎都會出現
- 由于在文本中“the”這樣的常用詞出現機率很大,是以我們将會有大量的(”the“,…)這樣的訓練樣本,而這些樣本數量遠遠超過了我們學習“the”這個詞向量所需的訓練樣本數
word2vec通過“抽樣”模式來解決這種高頻詞問題。它的基本思想如下:在訓練原始文本中遇到的每一個單詞,都有一定機率被我們從文本中删掉,而這個被删除的機率與單詞的頻率有關。
對于單詞 w w w, Z ( w ) Z(w) Z(w)為 w w w在所有語料中出現的頻次,保留 w w w的機率 P ( w ) P(w) P(w)定義為:
P ( w ) = ( Z ( w ) 0.001 + 1 ) × 0.001 Z ( w ) P(w) = \left( \sqrt{\frac{Z(w)}{0.001}}+1 \right) \times \frac{0.001}{Z(w)} P(w)=(0.001Z(w)
+1)×Z(w)0.001.
Negative sampling
訓練一個神經網絡意味着要輸入訓練樣本并且不斷調整神經元的權重,進而不斷提高對目标的準确預測。每當神經網絡經過一個訓練樣本的訓練,它的權重就會進行一次調整。是以,詞典的大小決定了Skip-Gram神經網絡将會擁有大規模的權重矩陣,所有的這些權重需要通過數以億計的訓練樣本來進行調整,這是非常消耗計算資源的,并且實際中訓練起來會非常慢。
負采樣(negative sampling)解決了這個問題,它是用來提高訓練速度并且改善所得到詞向量的品質的一種方法。不同于原本每個訓練樣本更新所有的權重,負采樣每次讓一個訓練樣本僅僅更新一小部分的權重,這樣就會降低梯度下降過程中的計算量。例如,當我們用訓練樣本 ( input word: “fox”, output word: “quick”) 來訓練神經網絡時,“ fox”和“quick”都是經過one-hot編碼的。如果詞典大小為10000,在輸出層,期望輸出的向量對應“quick”單詞的那個神經元結點的輸出為1,其餘9999個神經元結點的輸出為0。這9999個期望輸出為0的神經元結點所對應的單詞稱為“negative” word。
當使用負采樣時,我們将随機選擇一小部分negative words(比如選5個negative words)來更新對應的權重,并且對所有“positive” word進行權重更新(在上面的例子中,這個單詞指的是”quick“)。在論文中,作者指出指出對于小規模資料集,選擇5-20個negative words會比較好,對于大規模資料集可以僅選擇2-5個negative words。word2vec使用“一進制模型分布(unigram distribution)”來選擇“negative words”。一個單詞被選作negative sample的機率跟它出現的頻次有關,出現頻次越高的單詞越容易被選作negative words。每個單詞被選為“negative words”的機率計算公式如下:
P ( w ) = Z ( w ) 3 / 4 ∑ Z ( w i ) 3 / 4 P(w) = \frac{Z(w)^{3/4}}{\sum Z(w_i)^{3/4} } P(w)=∑Z(wi)3/4Z(w)3/4
其中 Z ( w ) Z(w) Z(w)為單詞 w w w出現的頻次, 3 / 4 3/4 3/4是基于經驗給出的。
在負采樣的代碼實作中,unigram table有一個包含了一億個元素的數組,這個數組是由詞彙表中每個單詞的索引号填充的,并且這個數組有重複,也就是說有些單詞會出現多次。單詞的索引在這個數組中出現的次數 = 計算出的負采樣機率 × 1 0 8 \times 10^8 ×108。有了這張表以後,進行負采樣隻需要在 0 − 1 0 8 0-10^8 0−108的範圍内生成一個随機數,然後選擇表中索引号為這個随機數的單詞作為negative word即可。一個單詞的負采樣機率越大,那麼它在這個表中出現的次數就越多,它被選中的機率就越大。
Hierarchical Softmax
關于Softmax的定義在上一篇文章中已經讨論過了,下面對Hierarchical Softmax的結構進行讨論。
霍夫曼樹
- 輸入:權值為 ( w 1 , w 2 , . . . , w n ) (w_1,w_2,...,w_n) (w1,w2,...,wn)的 n n n個節點
- 輸出:對應的霍夫曼樹
霍夫曼樹的生成過程:
- 将 ( w 1 , w 2 , . . . , w n ) (w_1,w_2,...,w_n) (w1,w2,...,wn)看做是有 n n n棵樹的森林,每個樹僅有一個節點
- 在森林中選擇根節點權值最小的兩棵樹進行合并,得到一個新的樹,這兩顆樹分布作為新樹的左右子樹。新樹的根節點權重為左右子樹的根節點權重之和
- 将之前的根節點權值最小的兩棵樹從森林删除,并把新樹加入森林
- 重複步驟 2 和 3 直到森林裡隻有一棵樹為止
下面用一個具體的例子來說明霍夫曼樹建立的過程,有 (a,b,c,d,e,f) 共6個節點,節點的權值分布為 (16,4,8,6,20,3)。首先是最小的b和f合并,得到的新樹根節點權重是7。此時森林裡5棵樹,根節點權重分别是16,8,6,20,7。此時根節點權重最小的6,7合并,得到新子樹,依次類推,最終得到下面的霍夫曼樹。
接下來對霍夫曼樹進行編碼,由于權重高的葉子節點越靠近根節點,而權重低的葉子節點會遠離根節點,這樣編碼後高權重節點編碼值較短,而低權重值編碼值較長。是以保證了樹的帶權路徑最短,也符合資訊論中的觀點,即常用的詞擁有更短的編碼。一般對于一個霍夫曼樹的節點(根節點除外),可以約定左子樹編碼為0,右子樹編碼為1。如上圖,則可以得到c的編碼是00。
Hierarchical Softmax過程
為了避免要計算所有詞的softmax機率,word2vec采樣了霍夫曼樹來代替從隐藏層到輸出softmax層的映射。
霍夫曼樹的建立:
- 根據标簽 (label) 和頻率建立霍夫曼樹 (label出現的頻率越高,霍夫曼樹的路徑越短)
- 霍夫曼樹中每一葉子結點代表一個label
Word2Vec參數說明
from gensim.models.word2vec import Word2Vec
model = Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5,
max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH)
- sentences:可以是一個list,對于大語料集,建議使用BrownCorpus,Text8Corpus或lineSentence建構
- size:是指特征向量的次元,預設為100
- alpha:是初始的學習速率,預設0.025,在訓練過程中會線性地遞減到min_alpha
- window:視窗大小,表示目前詞與預測詞在一個句子中的最大距離是多少
- min_count:可以對字典做截斷,詞頻少于min_count次數的單詞會被丢棄掉, 預設值為5
- max_vocab_size:設定詞向量建構期間的RAM限制,設定成None則沒有限制
- sample:高頻詞彙的随機降采樣的配置門檻值,預設為1e-3,範圍是(0,1e-5)
- seed:用于随機數發生器。與初始化詞向量有關
- workers:用于控制訓練的并行數
- min_alpha:學習率的最小值
- sg: 用于設定訓練算法,預設為0,對應CBOW算法;sg=1則采用skip-gram算法
- hs:如果為1則會采用hierarchica·softmax技巧。如果設定為0(預設),則使用negative sampling
- negative:如果>0,則會采用negativesampling,用于設定多少個noise words(一般是5-20)
- cbow_mean:如果為0,則采用上下文詞向量的和,如果為1(default)則采用均值,隻有使用CBOW的時候才起作用
- hashfxn: hash函數來初始化權重,預設使用python的hash函數
- iter: 疊代次數,預設為5
- trim_rule: 用于設定詞彙表的整理規則,指定那些單詞要留下,哪些要被删除。可以設定為None(min_count會被使用)
- sorted_vocab: 如果為1(預設),則在配置設定word index 的時候會先對單詞基于頻率降序排序
- batch_words:每一批的傳遞給線程的單詞的數量,預設為10000
參數的選擇與對比:
- skip-gram(訓練速度慢,對罕見字有效),CBOW(訓練速度快),一般選擇Skip-gram模型
- 訓練方法:Hierarchical Softmax(對罕見字有利),Negative Sampling(對常見字和低維向量有利)
- 欠采樣頻繁詞可以提高結果的準确性和速度(1e-3~1e-5)
- Window大小:Skip-gram通常選擇10左右,CBOW通常選擇5左右
利用word2vec訓練詞向量
from gensim.models.word2vec import Word2Vec
sentences = train_df['text'].apply(lambda x: [str(i) for i in x.split(' ')])
model = Word2Vec(sentences,size=300)
model.save("test_01.model") # 儲存模型,以便重用
print('詞典的内容如下:')
print(model.wv.index2word)
# 求與某個詞最相關的詞
print('\n與57相似的詞有:')
print(model.wv.most_similar('57'))
# 選出集合中不同類的詞語
print('\n集合4464 486 6352 5619 2465 4802 1452中與其他詞不同的詞為:')
print(model.wv.doesnt_match('4464 486 6352 5619 2465 4802 1452'.split()))
# 求兩個詞之間的相似度
print('\n1124與2000的相似度為:')
print(model.wv.similarity('1124','2000'))
# 選兩個集合之間的相似度
print('\n兩篇文章的相似度為:')
print(model.wv.n_similarity('4464 486 6352 5619 2465 4802 1452 3137 5778'.split(),
'3646 3055 3055 2490 4659 6065 3370 5814 2465'.split()))
加載模型:
model = gensim.models.Word2Vec.load('test_01.model')
追加訓練資料:
model = gensim.models.Word2Vec.load('test_01.model')
model.train(more_sentences)
TextCNN
TextCNN利用CNN(卷積神經網絡)進行文本特征抽取,不同大小的卷積核分别抽取n-gram特征,卷積計算出的特征圖經過MaxPooling保留最大的特征值,然後将拼接成一個向量作為文本的表示。這裡我們基于TextCNN原始論文的設定,分别采用了100個大小為2,3,4的卷積核,最後得到的文本向量大小為100*3=300維。
TextRNN
TextRNN利用RNN(循環神經網絡)進行文本特征抽取,由于文本本身是一種序列,而LSTM天然适合模組化序列資料。TextRNN将句子中每個詞的詞向量依次輸入到雙向雙層LSTM,分别将兩個方向最後一個有效位置的隐藏層拼接成一個向量作為文本的表示。
使用HAN用于文本分類
Hierarchical Attention Network for Document Classification(HAN)基于層級注意力,在單詞和句子級别分别編碼并基于注意力獲得文檔的表示,然後經過Softmax進行分類。其中word encoder的作用是獲得句子的表示,可以替換為上節提到的TextCNN和TextRNN,也可以替換為下節中的BERT。
TextCNN的實作
參考代碼:基于tensorflow的TextCNN
引入需要的庫:
import pandas as pd
import numpy as np
import tensorflow as tf
import os
import time
import datetime
讀入資料:
train_df = pd.read_csv('./data/train_set.csv', sep='\t', nrows=15000)
train_df['content'] = train_df['text'].apply(lambda x: [str(i) for i in x.split(' ')])
dfx = train_df['content']
dfy = train_df['label']
from tensorflow.contrib import learn
max_document_length = 500
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length,tokenizer_fn=list) # 建立詞典
x = np.array(list(vocab_processor.fit_transform(dfx))) # 把文本用詞典中的索引表示
from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
lb.fit(dfy)
y=lb.transform(dfy) # 将标簽矩陣二值化 每一行都是類似[0,1,0,0...0]的形式,為1的位置對應着第幾類
np.random.seed(10)
shuffle_indices = np.random.permutation(np.arange(len(y)))
x_shuffled = x[shuffle_indices]
y_shuffled = y[shuffle_indices]
dev_sample_index = -1 * int(0.1 * float(len(y)))
x_train, x_dev = x_shuffled[:dev_sample_index], x_shuffled[dev_sample_index:] # 分成訓練資料和驗證資料
y_train, y_dev = y_shuffled[:dev_sample_index], y_shuffled[dev_sample_index:]
del train_df, dfx, dfy, x, y, x_shuffled, y_shuffled
建構TextCNN:
# 定義TextCNN類
class TextCNN(object):
"""
A CNN for text classification.
Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
"""
def __init__(
self, sequence_length, num_classes, vocab_size,
embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0): # 初始化
# sequence_length 句子長度
# num_classes 類别數
# vocab_size 詞典大小
# embedding_size 詞嵌入次元,即把一個單詞表示成embedding_size那麼長的向量
# filter_sizes 卷積用的過濾器的大小 ([3, 4, 5])
# num_filters 每個尺寸的過濾器的數量
# l2_reg_lambda l2正則化的權值
# input, output 和 dropout 的占位符 (用于tf的結構化表示)
self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
# placeholder的第二個變量的第一維為batch_size,None意味着該次元可為任意值,使用None将該次元交給網絡自由決定
l2_loss = tf.constant(0.0)
# Embedding層 将詞向量使用更多元的向量表示
with tf.device('/cpu:0'), tf.name_scope("embedding"):
self.W = tf.Variable(
tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
name="W")
self.embedded_chars = tf.nn.embedding_lookup(self.W, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
# tf.device(’/cpu:0’)将Embedding操作交給cpu執行
# 預設情況下TensorFlow會将該操作交給gpu執行(前提是有gpu),但是目前embedding在gpu中執行會報錯
# tf.name_scope(“embedding”):本操作将embedding加入到命名空間(name scope)中
# 命名空間将所有操作加入到名為embedding的頂層節點中,是以在使用TensorBoard進行網絡可視化時能有一個良好的層次結構
# W是在訓練中學習的詞嵌入矩陣,使用随機均勻分布來初始化(即神經網絡的權重)
# tf.nn.embedding_lookup建立實際的嵌入操作。嵌入操作的結果是形狀為[None,sequence_length,embedding_size]的三維張量
# embedded_chars_expanded 的尺寸為 [None,sequence_length,embedding_size,1]
# 建構卷積層,然後進行max-pooling
pooled_outputs = [] # 池化輸出結果
for i, filter_size in enumerate(filter_sizes): # 周遊多個filter_size
with tf.name_scope("conv-maxpool-%s" % filter_size):
# 卷積層
filter_shape = [filter_size, embedding_size, 1, num_filters]
W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
conv = tf.nn.conv2d(
self.embedded_chars_expanded,
W,
strides=[1, 1, 1, 1],
padding="VALID",
name="conv") # tf.nn.conv2d 是tf進行卷積操作的函數
# 卷積之後的結果加上bisa(偏置量)傳入relu函數(神經網絡的激活函數)
h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
# Maxpooling
pooled = tf.nn.max_pool(
h,
ksize=[1, sequence_length - filter_size + 1, 1, 1],
strides=[1, 1, 1, 1],
padding='VALID',
name="pool")
pooled_outputs.append(pooled)
# 把所有池化的結果拼在一起
num_filters_total = num_filters * len(filter_sizes)
self.h_pool = tf.concat(pooled_outputs, 3)
self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])
# h_pool_flat的大小為 [batch_size,num_filters_total] ,在tf.reshape中使用-1可以告訴TensorFlow在可能的情況下平坦化次元
# Dropout層
# Dropout的想法很簡單:Dropout層随機“禁用”神經元的一部分(不更新權值也不參加神經網絡的計算),這可以防止過拟合
# 神經元中啟用的比例是由初始化參數中的dropout_keep_prob決定的,在測試集上運作時定義為1(禁用Dropout)
with tf.name_scope("dropout"):
self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)
# 評估和預測
with tf.name_scope("output"):
W = tf.get_variable(
"W",
shape=[num_filters_total, num_classes],
initializer=tf.contrib.layers.xavier_initializer()) # W為訓練得到的權重
b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b") # b是偏置量
l2_loss += tf.nn.l2_loss(W)
l2_loss += tf.nn.l2_loss(b)
self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores") # 計算 X*W + b
self.predictions = tf.argmax(self.scores, 1, name="predictions") # 預測,x的類别定義為scores的最大分量對應的類别
# 計算平均損失函數
with tf.name_scope("loss"):
losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
# 計算準确率
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
生成batches:
# 把data分成若幹個batch
# 一個batch含有batch_size條資料,在一個batch上進行訓練并更新權重稱為一個step
# 将data循環完一次稱為一個epoch,一個epoch需要執行num_batches_per_epoch個batch
# num_epochs指要進行多少次epoch
def batch_iter(data, batch_size, num_epochs, shuffle=True):
data = np.array(data)
data_size = len(data)
num_batches_per_epoch = int((len(data)-1)/batch_size) + 1
for epoch in range(num_epochs):
# 每個epoch都把資料打亂順序
if shuffle:
shuffle_indices = np.random.permutation(np.arange(data_size))
shuffled_data = data[shuffle_indices]
else:
shuffled_data = data
for batch_num in range(num_batches_per_epoch):
start_index = batch_num * batch_size
end_index = min((batch_num + 1) * batch_size, data_size)
yield shuffled_data[start_index:end_index] # yield傳回一個generator,該對象具有next()方法
模型參數:
# 模型參數
# flags是tf裡調用參數的一種方法
tf.flags.DEFINE_integer("embedding_dim", 64, "Dimensionality of character embedding (default: 128)")
tf.flags.DEFINE_string("filter_sizes", "3,4,5", "Comma-separated filter sizes (default: '3,4,5')")
tf.flags.DEFINE_integer("num_filters", 64, "Number of filters per filter size (default: 128)")
tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)")
tf.flags.DEFINE_float("l2_reg_lambda", 0.0, "L2 regularization lambda (default: 0.0)")
# 訓練參數
tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)")
tf.flags.DEFINE_integer("num_epochs", 20, "Number of training epochs (default: 200)")
tf.flags.DEFINE_integer("evaluate_every", 50, "Evaluate model on dev set after this many steps (default: 100)") # 每隔幾個step就在驗證集上進行驗證
tf.flags.DEFINE_integer("checkpoint_every", 50, "Save model after this many steps (default: 100)") # 每隔幾個step就儲存模型
tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)") # 最多儲存幾個checkpoint
# 系統參數
tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")
FLAGS = tf.flags.FLAGS
定義訓練函數:
# 定義訓練函數
def train(x_train, y_train, vocab_processor, x_dev, y_dev):
# 顯式建立graph便于訓練結束後釋放資源
with tf.Graph().as_default():
session_conf = tf.ConfigProto(
allow_soft_placement=FLAGS.allow_soft_placement,
log_device_placement=FLAGS.log_device_placement)
sess = tf.Session(config=session_conf)
with sess.as_default():
cnn = TextCNN(
sequence_length=x_train.shape[1],
num_classes=y_train.shape[1],
vocab_size=len(vocab_processor.vocabulary_),
embedding_size=FLAGS.embedding_dim,
filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),
num_filters=FLAGS.num_filters,
l2_reg_lambda=FLAGS.l2_reg_lambda)
# 使用Adam優化器極小化損失函數
global_step = tf.Variable(0, name="global_step", trainable=False)
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(cnn.loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
# 彙總梯度相關資訊
grad_summaries = []
for g, v in grads_and_vars:
if g is not None:
grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g) # tf.summary.scalar():添加标量統計結果
sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))
# tf.summary.histogram():添加任意shape的Tensor,統計這個Tensor的取值分布
grad_summaries.append(grad_hist_summary)
grad_summaries.append(sparsity_summary)
grad_summaries_merged = tf.summary.merge(grad_summaries)
# 擷取運作時間,定義輸出路徑
timestamp = str(int(time.time()))
out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
print("Writing to {}\n".format(out_dir))
# 彙總損失函數和精确度
loss_summary = tf.summary.scalar("loss", cnn.loss)
acc_summary = tf.summary.scalar("accuracy", cnn.accuracy)
# Train Summaries
train_summary_op = tf.summary.merge([loss_summary, acc_summary, grad_summaries_merged])
train_summary_dir = os.path.join(out_dir, "summaries", "train")
train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
# Dev summaries
dev_summary_op = tf.summary.merge([loss_summary, acc_summary])
dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)
# 建立Checkpoint儲存路徑
checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
checkpoint_prefix = os.path.join(checkpoint_dir, "model")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)
vocab_processor.save(os.path.join(out_dir, "vocab"))
# 初始化所有變量
sess.run(tf.global_variables_initializer())
# 一個step上執行的訓練步驟
def train_step(x_batch, y_batch):
"""
A single training step
"""
feed_dict = {
cnn.input_x: x_batch,
cnn.input_y: y_batch,
cnn.dropout_keep_prob: FLAGS.dropout_keep_prob
} # 傳入參數
_, step, summaries, loss, accuracy = sess.run(
[train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],
feed_dict)
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
train_summary_writer.add_summary(summaries, step)
# 一個step上執行的驗證步驟
def dev_step(x_batch, y_batch, writer=None):
"""
Evaluates model on a dev set
"""
feed_dict = {
cnn.input_x: x_batch,
cnn.input_y: y_batch,
cnn.dropout_keep_prob: 1.0
}
step, summaries, loss, accuracy = sess.run(
[global_step, dev_summary_op, cnn.loss, cnn.accuracy],
feed_dict)
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
if writer:
writer.add_summary(summaries, step)
# 生成batches
batches = batch_iter(
list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
for batch in batches:
x_batch, y_batch = zip(*batch)
train_step(x_batch, y_batch)
current_step = tf.train.global_step(sess, global_step)
if current_step % FLAGS.evaluate_every == 0:
print("\nEvaluation:")
dev_step(x_dev, y_dev, writer=dev_summary_writer)
print("")
if current_step % FLAGS.checkpoint_every == 0:
path = saver.save(sess, checkpoint_prefix, global_step=current_step)
print("Saved model checkpoint to {}\n".format(path))
進行訓練:
train(x_train, y_train, vocab_processor, x_dev, y_dev)
運作輸入如下日志資訊:
模型的權值、網絡結構等相關資訊都存儲在日志輸出的路徑中,如果想利用訓練好的模型對新的資料進行預測隻要從該路徑加載即可。
定義測試資料:(注意測試資料和訓練資料的詞向量表示應該是基于同一個詞典的)
train_df['content'] = train_df['text'].apply(lambda x: [str(i) for i in x.split(' ')])
dfx = train_df['content']
y_test = train_df['label'][12000:15000]
max_document_length = 500
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length,tokenizer_fn=list)
x_test = np.array(list(vocab_processor.fit_transform(dfx)))[12000:15000]
導入訓練好的模型進行測試:
# 導入之前訓練好的模型進行測試
graph = tf.Graph()
with graph.as_default():
session_conf = tf.ConfigProto(
allow_soft_placement=FLAGS.allow_soft_placement,
log_device_placement=FLAGS.log_device_placement)
sess = tf.Session(config=session_conf)
with sess.as_default():
saver = tf.train.import_meta_graph(r'C:\Users\modiker\NLP學習\runs\1596162730\checkpoints\model-4200.meta')
saver.restore(sess,tf.train.latest_checkpoint(r'C:\Users\modiker\NLP學習\runs\1596162730\checkpoints'))
input_x = graph.get_operation_by_name("input_x").outputs[0]
dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
predictions = graph.get_operation_by_name("output/predictions").outputs[0]
batches = batch_iter(list(x_test), FLAGS.batch_size, 1, shuffle=False)
all_predictions = []
for x_test_batch in batches:
batch_predictions = sess.run(predictions, {input_x: x_test_batch, dropout_keep_prob: 1.0})
all_predictions = np.concatenate([all_predictions, batch_predictions])
# Print accuracy if y_test is defined
if y_test is not None:
correct_predictions = float(sum(all_predictions == y_test))
print("Total number of test examples: {}".format(len(y_test)))
print("Accuracy: {:g}".format(correct_predictions/float(len(y_test))))
運作結果如下:
TextRNN的實作
參考代碼:基于tensorflow的TextRNN
基本上和TextCNN的實作是同樣的建構過程
讀入資料:
train_df = pd.read_csv('./data/train_set.csv', sep='\t', nrows=17000)
train_df['content'] = train_df['text'].apply(lambda x: [str(i) for i in x.split(' ')])
dfx = train_df['content']
dfy = train_df['label']
from tensorflow.contrib import learn
max_document_length = 500
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length,tokenizer_fn=list)
x = np.array(list(vocab_processor.fit_transform(dfx)))[:15000]
from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
lb.fit(dfy)
y=lb.transform(dfy)[:15000]
# 留下2000條資料作為測試資料
test_x = np.array(list(vocab_processor.fit_transform(dfx)))[15000:]
test_y = dfy[15000:]
np.random.seed(10)
shuffle_indices = np.random.permutation(np.arange(len(y)))
x_shuffled = x[shuffle_indices]
y_shuffled = y[shuffle_indices]
dev_sample_index = -1 * int(0.1 * float(len(y)))
x_train, x_dev = x_shuffled[:dev_sample_index], x_shuffled[dev_sample_index:]
y_train, y_dev = y_shuffled[:dev_sample_index], y_shuffled[dev_sample_index:]
del train_df, dfx, dfy, x, y, x_shuffled, y_shuffled
建構TextRNN:
# 建構TextRNN網絡結構
class TextRCNN:
def __init__(self, sequence_length, num_classes, vocab_size, word_embedding_size, context_embedding_size,
cell_type, hidden_size, l2_reg_lambda=0.0):
#olders for input, output and dropout
self.input_text = tf.placeholder(tf.int32, shape=[None, sequence_length], name='input_text')
self.input_y = tf.placeholder(tf.float32, shape=[None, num_classes], name='input_y')
self.dropout_keep_prob = tf.placeholder(tf.float32, name='dropout_keep_prob')
l2_loss = tf.constant(0.0)
text_length = self._length(self.input_text)
# Embeddings
with tf.device('/cpu:0'), tf.name_scope("embedding"):
self.W_text = tf.Variable(tf.random_uniform([vocab_size, word_embedding_size], -1.0, 1.0), name="W_text")
self.embedded_chars = tf.nn.embedding_lookup(self.W_text, self.input_text)
# Bidirectional(Left&Right) Recurrent Structure
with tf.name_scope("bi-rnn"):
fw_cell = self._get_cell(context_embedding_size, cell_type)
fw_cell = tf.nn.rnn_cell.DropoutWrapper(fw_cell, output_keep_prob=self.dropout_keep_prob)
bw_cell = self._get_cell(context_embedding_size, cell_type)
bw_cell = tf.nn.rnn_cell.DropoutWrapper(bw_cell, output_keep_prob=self.dropout_keep_prob)
(self.output_fw, self.output_bw), states = tf.nn.bidirectional_dynamic_rnn(cell_fw=fw_cell,
cell_bw=bw_cell,
inputs=self.embedded_chars,
sequence_length=text_length,
dtype=tf.float32)
with tf.name_scope("context"):
shape = [tf.shape(self.output_fw)[0], 1, tf.shape(self.output_fw)[2]]
self.c_left = tf.concat([tf.zeros(shape), self.output_fw[:, :-1]], axis=1, name="context_left")
self.c_right = tf.concat([self.output_bw[:, 1:], tf.zeros(shape)], axis=1, name="context_right")
with tf.name_scope("word-representation"):
self.x = tf.concat([self.c_left, self.embedded_chars, self.c_right], axis=2, name="x")
embedding_size = 2*context_embedding_size + word_embedding_size
with tf.name_scope("text-representation"):
W2 = tf.Variable(tf.random_uniform([embedding_size, hidden_size], -1.0, 1.0), name="W2")
b2 = tf.Variable(tf.constant(0.1, shape=[hidden_size]), name="b2")
self.y2 = tf.tanh(tf.einsum('aij,jk->aik', self.x, W2) + b2)
with tf.name_scope("max-pooling"):
self.y3 = tf.reduce_max(self.y2, axis=1)
with tf.name_scope("output"):
W4 = tf.get_variable("W4", shape=[hidden_size, num_classes], initializer=tf.contrib.layers.xavier_initializer())
b4 = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b4")
l2_loss += tf.nn.l2_loss(W4)
l2_loss += tf.nn.l2_loss(b4)
self.logits = tf.nn.xw_plus_b(self.y3, W4, b4, name="logits")
self.predictions = tf.argmax(self.logits, 1, name="predictions")
# Calculate mean cross-entropy loss
with tf.name_scope("loss"):
losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.input_y)
self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss
# Accuracy
with tf.name_scope("accuracy"):
correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, axis=1))
self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name="accuracy")
@staticmethod
def _get_cell(hidden_size, cell_type):
if cell_type == "vanilla":
return tf.nn.rnn_cell.BasicRNNCell(hidden_size)
elif cell_type == "lstm":
return tf.nn.rnn_cell.BasicLSTMCell(hidden_size)
elif cell_type == "gru":
return tf.nn.rnn_cell.GRUCell(hidden_size)
else:
print("ERROR: '" + cell_type + "' is a wrong cell type !!!")
return None
# Length of the sequence data
@staticmethod
def _length(seq):
relevant = tf.sign(tf.abs(seq))
length = tf.reduce_sum(relevant, reduction_indices=1)
length = tf.cast(length, tf.int32)
return length
# Extract the output of last cell of each sequence
# Ex) The movie is good -> length = 4
# output = [ [1.314, -3.32, ..., 0.98]
# [0.287, -0.50, ..., 1.55]
# [2.194, -2.12, ..., 0.63]
# [1.938, -1.88, ..., 1.31]
# [ 0.0, 0.0, ..., 0.0]
# ...
# [ 0.0, 0.0, ..., 0.0] ]
# The output we need is 4th output of cell, so extract it.
@staticmethod
def last_relevant(seq, length):
batch_size = tf.shape(seq)[0]
max_length = int(seq.get_shape()[1])
input_size = int(seq.get_shape()[2])
index = tf.range(0, batch_size) * max_length + (length - 1)
flat = tf.reshape(seq, [-1, input_size])
return tf.gather(flat, index)
生成batches:
同TextCNN
模型參數:
# Model Hyperparameters
tf.flags.DEFINE_string("cell_type", "vanilla", "Type of RNN cell. Choose 'vanilla' or 'lstm' or 'gru' (Default: vanilla)")
tf.flags.DEFINE_string("word2vec", None, "Word2vec file with pre-trained embeddings")
tf.flags.DEFINE_integer("word_embedding_dim", 128, "Dimensionality of word embedding (Default: 300)")
tf.flags.DEFINE_integer("context_embedding_dim", 128, "Dimensionality of context embedding(= RNN state size) (Default: 512)")
tf.flags.DEFINE_integer("hidden_size", 64, "Size of hidden layer (Default: 512)")
tf.flags.DEFINE_float("dropout_keep_prob", 0.7, "Dropout keep probability (Default: 0.7)")
tf.flags.DEFINE_float("l2_reg_lambda", 0.5, "L2 regularization lambda (Default: 0.5)")
# Training parameters
tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (Default: 64)")
tf.flags.DEFINE_integer("num_epochs", 5, "Number of training epochs (Default: 10)")
tf.flags.DEFINE_integer("display_every", 10, "Number of iterations to display training info.")
tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps")
tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps")
tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store")
tf.flags.DEFINE_float("learning_rate", 1e-3, "Which learning rate to start with. (Default: 1e-3)")
# Misc Parameters
tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")
tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")
FLAGS = tf.flags.FLAGS
FLAGS._parse_flags()
print("\nParameters:")
for attr, value in sorted(FLAGS.__flags.items()):
print("{} = {}".format(attr.upper(), value))
print("")
定義訓練函數:
def train(x_train, y_train, vocab_processor, x_dev, y_dev):
with tf.Graph().as_default():
session_conf = tf.ConfigProto(
allow_soft_placement=FLAGS.allow_soft_placement,
log_device_placement=FLAGS.log_device_placement)
sess = tf.Session(config=session_conf)
with sess.as_default():
rcnn = TextRCNN(
sequence_length=x_train.shape[1],
num_classes=y_train.shape[1],
vocab_size=len(vocab_processor.vocabulary_),
word_embedding_size=FLAGS.word_embedding_dim,
context_embedding_size=FLAGS.context_embedding_dim,
cell_type=FLAGS.cell_type,
hidden_size=FLAGS.hidden_size,
l2_reg_lambda=FLAGS.l2_reg_lambda
)
# Define Training procedure
global_step = tf.Variable(0, name="global_step", trainable=False)
train_op = tf.train.AdamOptimizer(FLAGS.learning_rate).minimize(rcnn.loss, global_step=global_step)
# Output directory for models and summaries
timestamp = str(int(time.time()))
out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))
print("Writing to {}\n".format(out_dir))
# Summaries for loss and accuracy
loss_summary = tf.summary.scalar("loss", rcnn.loss)
acc_summary = tf.summary.scalar("accuracy", rcnn.accuracy)
# Train Summaries
train_summary_op = tf.summary.merge([loss_summary, acc_summary])
train_summary_dir = os.path.join(out_dir, "summaries", "train")
train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)
# Dev summaries
dev_summary_op = tf.summary.merge([loss_summary, acc_summary])
dev_summary_dir = os.path.join(out_dir, "summaries", "dev")
dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)
# Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it
checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))
checkpoint_prefix = os.path.join(checkpoint_dir, "model")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)
# Write vocabulary
vocab_processor.save(os.path.join(out_dir, "text_vocab"))
# Initialize all variables
sess.run(tf.global_variables_initializer())
# Pre-trained word2vec
if FLAGS.word2vec:
# initial matrix with random uniform
initW = np.random.uniform(-0.25, 0.25, (len(vocab_processor.vocabulary_), FLAGS.word_embedding_dim))
# load any vectors from the word2vec
print("Load word2vec file {0}".format(FLAGS.word2vec))
with open(FLAGS.word2vec, "rb") as f:
header = f.readline()
vocab_size, layer1_size = map(int, header.split())
binary_len = np.dtype('float32').itemsize * layer1_size
for line in range(vocab_size):
word = []
while True:
ch = f.read(1).decode('latin-1')
if ch == ' ':
word = ''.join(word)
break
if ch != '\n':
word.append(ch)
idx = vocab_processor.vocabulary_.get(word)
if idx != 0:
initW[idx] = np.fromstring(f.read(binary_len), dtype='float32')
else:
f.read(binary_len)
sess.run(rcnn.W_text.assign(initW))
print("Success to load pre-trained word2vec model!\n")
# Generate batches
batches = batch_iter(
list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)
# Training loop. For each batch...
for batch in batches:
x_batch, y_batch = zip(*batch)
# Train
feed_dict = {
rcnn.input_text: x_batch,
rcnn.input_y: y_batch,
rcnn.dropout_keep_prob: FLAGS.dropout_keep_prob
}
_, step, summaries, loss, accuracy = sess.run(
[train_op, global_step, train_summary_op, rcnn.loss, rcnn.accuracy], feed_dict)
train_summary_writer.add_summary(summaries, step)
# Training log display
if step % FLAGS.display_every == 0:
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))
# Evaluation
if step % FLAGS.evaluate_every == 0:
print("\nEvaluation:")
feed_dict_dev = {
rcnn.input_text: x_dev,
rcnn.input_y: y_dev,
rcnn.dropout_keep_prob: 1.0
}
summaries_dev, loss, accuracy = sess.run(
[dev_summary_op, rcnn.loss, rcnn.accuracy], feed_dict_dev)
dev_summary_writer.add_summary(summaries_dev, step)
time_str = datetime.datetime.now().isoformat()
print("{}: step {}, loss {:g}, acc {:g}\n".format(time_str, step, loss, accuracy))
# Model checkpoint
if step % FLAGS.checkpoint_every == 0:
path = saver.save(sess, checkpoint_prefix, global_step=step)
print("Saved model checkpoint to {}\n".format(path))
進行訓練:
train(x_train, y_train, vocab_processor, x_dev, y_dev)
參考文章:
[NLP] 秒懂詞向量Word2vec的本質