Python 深度學習--學習筆記（十一）

使用預訓練模型學習判斷imdb評論正負面模型

本節的模型與上節見過的那個類似：将句子嵌入到向量序列中，然後将其展平，最後在上面訓練一個 Dense 層。但此處将使用預訓練的詞嵌入。此外，我們将從頭開始，先下載下傳IMDB 原始文本資料，而不是使用 Keras 内置的已經預先分詞的 IMDB 資料。

首先，在 http://mng.bz/0tIo ,下載下傳原始IMDB資料集并解壓。

檔案夾的結構如下：

aclImdb：

├─test

│ ├─neg

│ └─pos

└─train

├─neg

└─pos

随便打開一個neg檔案夾裡的文本：

Python 深度學習--學習筆記（十一）

想必，這就是評論的原生資料，并沒有像之前keras内置的imdb資料集那樣幫我們處理好了單詞->序列的轉化。

處理 IMDB 原始資料的标簽

#導入要處理的資料
import os

imdb_dir = 'C:\\Users\\Administrator\\Desktop\\Keras_learn\\aclImdb'
train_dir = os.path.join(imdb_dir,'train')

labels = []
texts = []

for label_type in ['neg','pos']:
    dir_name = os.path.join(train_dir,label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name,fname), encoding='UTF-8')
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

這裡從訓練集中，分别導入neg和pos檔案夾裡的文本資訊，并在 labels 相應的索引處添加标簽值，0表示負面評價，1表示正面評價。

讓我們驗證下導入是否成功：

print(len(texts))
print(texts[0])

輸出：

Python 深度學習--學習筆記（十一）

正負面評價總共有25 000 條。

處理資料

keras内置了分詞器（tokenizer）功能，可以在限制單詞數的前提下，為需要的單詞編号。

#處理資料
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100 #在100個單詞後截斷評論
training_samples = 200 #200個訓練樣本
validation_samples = 10000  #10 000個驗證樣本
max_words = 10000 #隻考慮資料集中前10 000 個最常見的單詞

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts) #将單詞轉化為序列

word_index = tokenizer.word_index
print('Found %s unique tokens.'% len(word_index))

data = pad_sequences(sequences,maxlen=maxlen)
#不夠maxlen用 0 填補

labels = np.asarray(labels)
print('Shape of data tensor:',data.shape)
print('Shape of label tensor:',labels.shape)

#打亂順序
indices = np.arange(data.shape[0])
indices = np.random.choice(indices,indices.shape[0])

data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples:training_samples + validation_samples]
y_val = labels[training_samples:training_samples + validation_samples]

輸出：

Python 深度學習--學習筆記（十一）

下載下傳GloVe詞嵌入

打開https://nlp.stanford.edu/projects/glove,下載下傳 2014 年英文維基百科的預計算嵌入。這是一個 822 MB 的壓縮檔案，檔案名是 glove.6B.zip，裡面包含 400 000 個單詞（或非單詞的标記）的 100 維嵌入向量。解壓檔案。
導入模型參數

我們用字典的方式，導入詞——向量。

# 解析 GloVe 詞嵌入檔案
glove_dir = 'C:\\Users\\Administrator\\Desktop\\Keras_learn\\glove'
#檔案裡每行是一個單詞，和該單詞的詞向量
embeddings_index = {}
f = open(os.path.join(glove_dir,'glove.6B.100d.txt'),encoding='utf-8')
#建構  詞：向量  的字典
for line in f:
    values = line.split()
    word = values[0] #獲得單詞
    coefs = np.asarray(values[1:],dtype='float32') #獲得詞向量
    embeddings_index[word] = coefs #建構  詞：向量  的字典
f.close()

print('Found %s word vectors.'%len(embeddings_index))

輸出：

Python 深度學習--學習筆記（十一）

準備 GloVe 詞嵌入矩陣

要讓參數注入模型中，必須将這個字典轉為矩陣的形式。

#準備 GloVe 詞嵌入矩陣
embedding_dim = 100

embedding_matrix = np.zeros((max_words,embedding_dim))
for word, i in word_index.items():
    if  i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
            #在glove中找不到的詞，其嵌入向量全為0

這裡，值得注意的是，每個單詞都轉為一個100維向量。

模型定義

#模型定義
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words,embedding_dim,input_length=maxlen))
#max_words==10 000, embedding_dim==100, maxlen==100
model.add(Flatten())
model.add(Dense(32,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.summary()

Python 深度學習--學習筆記（十一）

将預訓練的詞嵌入加載到 Embedding 層中

#model的第一層即為embedding
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

編譯，訓練模型

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train,y_train,
                   epochs=10,
                   batch_size=32,
                   validation_data=(x_val,y_val))
model.save_weights('pre_trained_glove_model.h5')

輸出：

Python 深度學習--學習筆記（十一）

繪制結果

#繪制結果
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1,len(acc)+1)

plt.figure("acc")
plt.plot(epochs,acc,'bo',label="Training acc")
plt.plot(epochs,val_acc,'b',label="Validation acc")
plt.title("Traning and validation accuracy")
plt.legend()

plt.figure("loss")
plt.plot(epochs,loss,'bo',label="Training loss")
plt.plot(epochs,val_loss,'b',label="Validation loss")
plt.title("Traning and validation loss")
plt.legend()

plt.show()

Python 深度學習--學習筆記（十一）

可以明顯看出，由于訓練樣本隻有 200，會嚴重依賴樣本，訓練集接近1，而驗證集的準确度隻有0.56。但換個角度想想，隻用了200個資料集就可以達到超過一半的準确度，也是不容易的。

Python 深度學習--學習筆記（十一）

使用預訓練模型學習判斷imdb評論正負面模型

繼續閱讀

TestLink導出用例轉換工具(XML2Excel)

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

cs231n斯坦福基于卷積神經網絡的CV學習筆記（一）KNN和線性分類器/分類器損失/反向傳播一，KNN圖像分類算法二，線性分類器三，線性分類器損失四，反向傳播五，神經網絡

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入