文章目錄

@[toc]

1 elmo是什麼？
- ELMo的特點：
2 Elmo訓練有哪些好項目？
- **有訓練過程的項目**
- **預訓練模型：**
3 Elmo訓練流程
- 3.1 elmo訓練流程
- 3.2 elmo如何fine-tune到其他領域？？
- 3.3 elmo具體使用的方式
4 英文預訓練模型
- 4.1 首推[Elmo Embeddings in Keras with TensorFlow hub](https://towardsdatascience.com/elmo-embeddings-in-keras-with-tensorflow-hub-7eb6f0145440)
- 4.2 allenai/bilm-tf官方使用方式
- 4.3 UKPLab/elmo-bilstm-cnn-crf
- 4.4 Using ELMo programmatically
5 中文訓練與相關經驗
- 5.1 相關訓練項目
- 5.2 elmo實戰經驗小結
  - 5.2.1 一則
  - 5.2.2 二則
  - 5.2.3 三則
  - 5.2.4 四則

1 elmo是什麼？

參考：《文本嵌入的經典模型與最新進展》

人們已經提出了大量可能的詞嵌入方法。最常用的模型是 word2vec 和 GloVe，它們都是基于分布假設的無監督學習方法（在相同上下文中的單詞往往具有相似的含義）。

雖然有些人通過結合語義或句法知識的有監督來增強這些無監督的方法，但純粹的無監督方法在 2017-2018 中發展非常有趣，最著名的是 FastText（word2vec的擴充）和 ELMo（最先進的上下文詞向量）。

流水賬︱Elmo詞向量中文訓練過程雜記1 elmo是什麼？2 Elmo訓練有哪些好項目？3 Elmo訓練流程4 英文預訓練模型5 中文訓練與相關經驗

在ELMo 中，每個單詞被賦予一個表示，它是它們所屬的整個語料庫句子的函數。所述的嵌入來自于計算一個兩層雙向語言模型（LM）的内部狀态，是以得名「ELMo」：Embeddings from Language Models。

ELMo embeddings論文路徑

ELMo的特點：

ELMo 的輸入是字母而不是單詞。是以，他們可以利用子字詞單元來計算有意義的表示，即使對于詞典外的詞（如 FastText這個詞）也是如此。
ELMo 是 biLMs 幾層激活的串聯。語言模型的不同層對單詞上的不同類型的資訊進行編碼（如在雙向LSTM神經網絡中，詞性标注在較低層編碼好，而詞義消歧義用上層編碼更好）。連接配接所有層可以自由組合各種文字表示，以提高下遊任務的性能。

2 Elmo訓練有哪些好項目？

閑話不多數，理論自己補，因為筆者懶得複現，于是就去找捷徑。。。

有訓練過程的項目

就是兩個項目：allenai/bilm-tf 和 UKPLab/elmo-bilstm-cnn-crf。

elmo是原生于該項目allenai/bilm-tf，py3.5
tf1.2就可以運作，當然有些時間需要預裝他們的allennlp，原生的是自帶訓練子產品。
那麼基于此，UKPlab（deeplearning4）改編了一個版本UKPLab/elmo-bilstm-cnn-crf，配置為py3 + tf1.8，而且應用在了bilstm-cnn-crf任務之中。兩個版本因為對tf版本要求不一，是以最好用他們的docker。

預訓練模型：

還有tensorflow hub之中（雙版本，1版、2版），有英文的預訓練模型，可以直接拿來用的那種，于是有很多延伸：

項目一：PrashantRanjan09/WordEmbeddings-Elmo-Fasttext-Word2Vec，該項目對比了

0 – Word2vec, 1 – Gensim FastText, 2- Fasttext (FAIR), 3-

ELMo，幾種詞向量。但是引用的是hub中預訓練的模型，沒有自帶訓練子產品；
項目二：strongio/keras-elmo

的 Elmo Embeddings in Keras with TensorFlow

hub，在hub基礎上用keras訓練了一個簡單二分類情感，非常贊的教程，但是沒有提供訓練子產品，調用的是hub。

還有挺多小項目的，大多基于tf-hub（提示，hub要tf 1.7以上）預訓練模型做一些小應用，那麼問題就來了，英文有預訓練，中文呢？

筆者找到了searobbersduck/ELMo_Chin，該作用小說訓練了一套，筆者按照提示也是能夠訓練，隻不過該作者隻寫了訓練過程沒有寫訓練完怎麼用起來，是以需要和前面幾個項目對起來看。

3 Elmo訓練流程

3.1 elmo訓練流程

allenNLP給出的解答，計算elmo的流程：

Prepare input data and a vocabulary file.
Train the biLM.
Test (compute the perplexity of) the biLM on heldout data.
Write out the weights from the trained biLM to a hdf5 file.(checkpoint -> hdf5)

3.2 elmo如何fine-tune到其他領域？？

First download the checkpoint files above. Then prepare the dataset as described in the section “Training a biLM on a new corpus”, with the exception that we will use the existing vocabulary file instead of creating a new one. Finally, use the script bin/restart.py to restart training with the existing checkpoint on the new dataset. For small datasets (e.g. < 10 million tokens) we only recommend tuning for a small number of epochs and monitoring the perplexity on a heldout set, otherwise the model will overfit the small dataset.

3.3 elmo具體使用的方式

來自allennlp/Using pre-trained models，三種使用方式，其中提到的使用方式為整段/整個資料集一次性向量化并儲存，There are three ways to integrate ELMo representations into a downstream task, depending on your use case.

Compute representations on the fly from raw text using character input. This is the most general method and will handle any input text. It is also the most computationally expensive.
Precompute and cache the context independent token representations, then compute context dependent representations using the biLSTMs for input data. This method is less computationally expensive then #1,

but is only applicable with a fixed, prescribed vocabulary.
Precompute the representations for your entire dataset and save to a file.

4 英文預訓練模型

筆者抛磚引玉，給有心人整理一下英文預訓練模型使用方式。

4.1 首推Elmo Embeddings in Keras with TensorFlow hub

code來自：strongio/keras-elmo，隻給出重點：

# Import our dependencies# Import 
import tensorflow as tf
import pandas as pd
import tensorflow_hub as hub
import os
import re
from keras import backend as K
import keras.layers as layers
from keras.models import Model
import numpy as np

# Initialize session
sess = tf.Session()
K.set_session(sess)

# Now instantiate the elmo model
elmo_model = hub.Module("https://tfhub.dev/google/elmo/1", trainable=True)
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())

# Build our model

# We create a function to integrate the tensorflow model with a Keras model
# This requires explicitly casting the tensor to a string, because of a Keras quirk
def ElmoEmbedding(x):
    return elmo_model(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]
 
input_text = layers.Input(shape=(1,), dtype=tf.string)
embedding = layers.Lambda(ElmoEmbedding, output_shape=(1024,))(input_text)
dense = layers.Dense(256, activation='relu')(embedding)
pred = layers.Dense(1, activation='sigmoid')(dense)

model = Model(inputs=[input_text], outputs=pred)

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# Fit!
model.fit(train_text, 
          train_label,
          validation_data=(test_text, test_label),
          epochs=5,
          batch_size=32)

>>> Train on 25000 samples, validate on 25000 samples
>>> Epoch 1/5
>>>  1248/25000 [>.............................] - ETA: 3:23:34 - loss: 0.6002 - acc: 0.6795

複制

打開Hub中的模型

hub.Module("https://tfhub.dev/google/elmo/1", trainable=True)

，以及加載

embedding = layers.Lambda(ElmoEmbedding, output_shape=(1024,))(input_text)

elmo詞向量。

4.2 allenai/bilm-tf官方使用方式

主要是第三章提到的三種使用方式：usage_cached.py 、 usage_character.py 、 usage_token.py

import tensorflow as tf
import os
from bilm import TokenBatcher, BidirectionalLanguageModel, weight_layers, \
    dump_token_embeddings

# Dump the token embeddings to a file. Run this once for your dataset.
token_embedding_file = 'elmo_token_embeddings.hdf5'
dump_token_embeddings(
    vocab_file, options_file, weight_file, token_embedding_file
)
tf.reset_default_graph()

# Build the biLM graph.
bilm = BidirectionalLanguageModel(
    options_file,
    weight_file,
    use_character_inputs=False,
    embedding_weight_file=token_embedding_file
)

# Get ops to compute the LM embeddings.
context_embeddings_op = bilm(context_token_ids)

elmo_context_input = weight_layers('input', context_embeddings_op, l2_coef=0.0)

# run
with tf.Session() as sess:
    # It is necessary to initialize variables once before running inference.
    sess.run(tf.global_variables_initializer())

    # Create batches of data.
    context_ids = batcher.batch_sentences(tokenized_context)
    question_ids = batcher.batch_sentences(tokenized_question)

    # Compute ELMo representations (here for the input only, for simplicity).
    elmo_context_input_, elmo_question_input_ = sess.run(
        [elmo_context_input['weighted_op'], elmo_question_input['weighted_op']],
        feed_dict={context_token_ids: context_ids,
                   question_token_ids: question_ids}
    )

複制

4.3 UKPLab/elmo-bilstm-cnn-crf

來自elmo-bilstm-cnn-crf/Keras_ELMo_Tutorial.ipynb，與首推一樣也是keras寫了一個二分類。訓練步驟包含：

we will include it into a preprocessing step:

We read in the dataset (here the IMDB dataset)
Text is tokenized and truncated to a fix length
Each text is fed as a sentence to the AllenNLP ElmoEmbedder to get a 1024 dimensional embedding for each word in the document
These embeddings are then fed to our neural network that we train

import keras
import os
import sys
from allennlp.commands.elmo import ElmoEmbedder
import numpy as np
import random
from keras.models import Sequential
from keras.layers import Dense, Conv1D, GlobalMaxPooling1D, GlobalAveragePooling1D, Activation, Dropout

# Lookup the ELMo embeddings for all documents (all sentences) in our dataset. Store those
# in a numpy matrix so that we must compute the ELMo embeddings only once.
def create_elmo_embeddings(elmo, documents, max_sentences = 1000):
    num_sentences = min(max_sentences, len(documents)) if max_sentences > 0 else len(documents)
    print("\n\n:: Lookup of "+str(num_sentences)+" ELMo representations. This takes a while ::")
    embeddings = []
    labels = []
    tokens = [document['tokens'] for document in documents]
    
    documentIdx = 0
    for elmo_embedding in elmo.embed_sentences(tokens):  
        document = documents[documentIdx]
        # Average the 3 layers returned from ELMo
        avg_elmo_embedding = np.average(elmo_embedding, axis=0)
             
        embeddings.append(avg_elmo_embedding)        
        labels.append(document['label'])
            
        # Some progress info
        documentIdx += 1
        percent = 100.0 * documentIdx / num_sentences
        line = '[{0}{1}]'.format('=' * int(percent / 2), ' ' * (50 - int(percent / 2)))
        status = '\r{0:3.0f}%{1} {2:3d}/{3:3d} sentences'
        sys.stdout.write(status.format(percent, line, documentIdx, num_sentences))
        
        if max_sentences > 0 and documentIdx >= max_sentences:
            break
            
    return embeddings, labels


elmo = ElmoEmbedder(cuda_device=1) #Set cuda_device to the ID of your GPU if you have one
train_x, train_y = create_elmo_embeddings(elmo, train_data, 1000)
test_x, test_y  = create_elmo_embeddings(elmo, test_data, 1000)

# :: Pad the x matrix to uniform length ::
def pad_x_matrix(x_matrix):
    for sentenceIdx in range(len(x_matrix)):
        sent = x_matrix[sentenceIdx]
        sentence_vec = np.array(sent, dtype=np.float32)
        padding_length = max_tokens - sentence_vec.shape[0]
        if padding_length > 0:
            x_matrix[sentenceIdx] = np.append(sent, np.zeros((padding_length, sentence_vec.shape[1])), axis=0)

    matrix = np.array(x_matrix, dtype=np.float32)
    return matrix

train_x = pad_x_matrix(train_x)
train_y = np.array(train_y)

test_x = pad_x_matrix(test_x)
test_y = np.array(test_y)

print("Shape Train X:", train_x.shape)
print("Shape Test Y:", test_x.shape)

# Simple model for sentence / document classification using CNN + global max pooling
model = Sequential()
model.add(Conv1D(filters=250, kernel_size=3, padding='same'))
model.add(GlobalMaxPooling1D())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_x, train_y, validation_data=(test_x, test_y), epochs=10, batch_size=32)

複制

在

ElmoEmbedder

把tensorflow預訓練模型加載。

4.4 Using ELMo programmatically

來自allennlp Using ELMo programmatically的片段

from allennlp.modules.elmo import Elmo, batch_to_ids

options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

elmo = Elmo(options_file, weight_file, 2, dropout=0)

# use batch_to_ids to convert sentences to character ids
sentences = [['First', 'sentence', '.'], ['Another', '.']]
character_ids = batch_to_ids(sentences)

embeddings = elmo(character_ids)

# embeddings['elmo_representations'] is length two list of tensors.
# Each element contains one layer of ELMo representations with shape
# (2, 3, 1024).
#   2    - the batch size
#   3    - the sequence length of the batch
#   1024 - the length of each ELMo vector

複制

If you are not training a pytorch model, and just want numpy arrays as output then use allennlp.commands.elmo.ElmoEmbedder.

5 中文訓練與相關經驗

5.1 相關訓練項目

一共有三個中文訓練的源頭。

（1）可參考：searobbersduck/ELMo_Chin，不過好像過程中有些問題，筆者還沒證明原因。

（2）博文：《如何将ELMo詞向量用于中文》，該教程用glove作為初始化向量，思路如下：

将預訓練的詞向量讀入
修改bilm-tf代碼
- option部分
- 添加給embedding weight賦初值
- 添加儲存embedding weight的代碼
開始訓練，獲得checkpoint和option檔案
運作腳本，獲得language model的weight檔案
将embedding weight儲存為hdf5檔案形式
運作腳本，将語料轉化成ELMo embedding。

（3）HIT-SCIR/ELMoForManyLangs，哈工大今年CoNLL評測的多國語ELMo，有繁體中文。

其中教程中主要需要提取三個内容：

詞彙表 vocab_seg_words_elmo.txt ；詞表檔案的開頭必須要有 <S> </S> <UNK> ，且大小寫敏感。并且應當按照單詞的詞頻降序排列。可以通過手動添加這三個特殊符号。

立足
酸甜
冷笑
吃飯
市民
熟
金剛
日月同輝
光

複制

資料源進行分詞 vocab_seg_words.txt

有 德克士 吃 [ 色 ] ， 心情 也 是 開朗 的
首選 都 是 德克士 [ 酷 ] [ 酷 ]
德克士 好樣 的 ， 偶 也 發現 了 鮮萃 檸檬 飲 。
有 德克士 ， 能 讓 你 真正 的 幸福 哦
以後 多 給 我們 推出 這麼 到位 的 搭配 ， 德克士 我們 等 着
貼心 的 德克士 ， 吃貨 們 分享 起來
又 學到 好 知識 了 ， 感謝 德克士 [ 吃驚 ]
德克士 一直 久存 于心

複制

參數配置表option.json

其中有幾個參數需要注意一下：
- n_train_tokens，訓練集中總token數量
- max_characters_per_token，單個token最長字元串長度
- n_tokens_vocab，詞彙表 vocab_seg_words_elmo.txt
- n_characters，n_tokens_vocab + max_characters_per_token - 1 (筆者不确定)

如何使用，筆者參考的是4.2 三種中的 usage_token.py，其他兩種貌似總是報錯。

5.2 elmo實戰經驗小結

5.2.1 一則

一些回答來自知乎：劉一佳。哈工大他們的算法ELMO用的20M詞的生語料訓練的，有的是他們自己IDE訓練算法，比bilm-tf顯存效率高一點，訓練穩定性高一些。他們也給出以下幾個經驗：

句法任務中，OOV比例高的資料ELMo效果好，多國語言中OOV rate與ELMo帶來的提升最為明顯；
訓練資料少或接近zero-shot，one-shot，ELMo表現更好；
訓練資料較多，譬如dureader資料，elmo沒啥效果；
有些公司用了，覺得效果明顯，甚至上生産環境，有的公司則效果不佳，具體任務來看。

5.2.2 二則

在博文《吾愛NLP(5)—詞向量技術-從word2vec到ELMo》解釋了一下ELMo與word2vec最大的不同：

Contextual: The representation for each word depends on the entire context in which it is used.　

（即詞向量不是一成不變的，而是根據上下文而随時變化，這與word2vec或者glove具有很大的差別）

舉個例子：針對某一詞多義的詞彙w="蘋果"
文本序列1=“我 買了 六斤 蘋果。”
文本序列2=“我 買了一個 蘋果 7。”
上面兩個文本序列中都出現了“蘋果”這個詞彙，但是在不同的句子中，它們我的含義顯示是不同的，一個屬于水果領域，一個屬于電子産品呢領域，如果針對“蘋果”這個詞彙同時訓練兩個詞向量來分别刻畫不同領域的資訊呢？答案就是使用ELMo。

複制

5.2.3 三則

博文《NAACL2018 一種新的embedding方法–原理與實驗 Deep contextualized word representations (ELMo)》提到：

ELMo的效果非常好, 我自己在SQuAD資料集上可以提高3個左右百分點的準确率. 因為是上下文相關的embedding,

是以在一定程度上解決了一詞多義的語義問題.
ELMo速度非常慢, 資料集中包含越10萬篇短文, 每篇約400詞, 如果将batch設定為32, 用glove詞向量進行編碼,

過3個biLSTM, 3個Linear, 3個softmax/logsoftmax(其餘dropout, relu這種忽略不計),

在1080Ti(TiTan XP上也差不多)總共需要約15分鐘訓練完(包括bp)一個epoch. 而如果用ELMo對其進行編碼,

僅編碼時間就近一個小時, 全部使用的話因為次元非常大, 顯存占用極高, 需要使用多張卡,

加上多張卡之間排程和資料傳輸的花銷一個epoch需要2+小時(在4張卡上).

文中提出的效率解決的方式：

ELMo雖然對同一個單詞會編碼出不同的結果, 但是上下文相同的時候ELMo編碼出的結果是不變的(這裡不進行回傳更新LM的參數)因為論文中發現不同任務對不同層的LM編碼資訊的敏感程度不同, 比如SQuAD隻對第一和第二層的編碼資訊敏感, 那我們儲存的時候可以隻儲存ELMo編碼的一部分, 在SQuAD中隻儲存前兩層, 存儲空間可以降低1/3, 需要320G就可以了, 如果我們事先确定資料集對于所有不同層敏感程度(即上文提到的sj), 我們可以直接用系數超參sj對3層的輸出直接用∑Lj=0staskjhLMk,j壓縮成一個1024的向量, 這樣隻需要160G的存儲空間即可滿足需求.

5.2.4 四則

Improving NLP tasks by transferring knowledge from large data

To conclude, both papers prove that NLP tasks can benefit from large data. By leveraging either parallel MT corpus or monolingual corpus, there are several killer features in contextual word representation:

Model is able to disambiguate the same word into different

representation based on its context.
Thanks to the character-based convolution, representation of

out-of-vocabulary tokens can be derived from morphological clues.

However, ELMo can be a cut above CoVe not only because of the performance improvement in tasks, but the type of training data. Because eventually, data is what matters the most in industry. Monolingual data do not require as much of the annotation thus can be collected more easily and efficiently.

一些測評：

流水賬︱Elmo詞向量中文訓練過程雜記1 elmo是什麼？2 Elmo訓練有哪些好項目？3 Elmo訓練流程4 英文預訓練模型5 中文訓練與相關經驗

文章目錄 @[toc]