使用bert
Lexical Simplification (LS) is replacing complex words with simpler alternatives, which can help various groups of people, like children, non-native speakers, etc, to better understandig a given text.
大号 exical簡化(LS)與更簡單的方法,它可以幫助各類人群,如兒童,非母語等取代複雜的單詞,以understandig給定文本更好。
介紹 (Introduction)
In this article I’ll show how to make a lexical simplifier using NLTK, BERT and python, the main ideas are taken from this paper
在本文中,我将展示如何使用NLTK,BERT和python建立詞法簡化器,主要思想摘自本文。
This project will be address in 3 steps: 1. Identify complex words in a given sentence: Create a model that can detect or identify possible complex word, it is called complex word identification (CWI). 2. Generate candidates: Used BERT´s masked language model to get possible words candidates.3. Select the best candidates based on Zipf values: Compute the zipf values of each candidates to select the simplest one.
該項目将分3個步驟解決: 1.識别給定句子中的複雜單詞:建立一個可以檢測或識别可能的複雜單詞的模型,稱為複雜單詞識别(CWI)。 2.生成候選詞:使用BERT的屏蔽語言模型來擷取可能的單詞候選詞。 3.根據Zipf值選擇最佳候選者:計算每個候選者的zipf值以選擇最簡單的候選者。
The diagram of the work is as follows:
工作圖如下:

1.識别複雜的單詞 (1. Identify complex words)
The first goal is to be able to select the words in a given sentence which should be simplified, for that we need to train a model able to detect complex words in sentences.To train this model we are going to use the labeled dataset you can find on this page, and we are going to use a sequential architecture based in BiLSTM, which provides contextual information from both the left and right context of a target word.
第一個目标是能夠選擇給定句子中的單詞 , 該單詞 應該簡化,因為我們需要訓練一個能夠檢測句子中複雜單詞的模型。要訓練該模型,我們将使用标記的資料集可以在此頁面上找到,我們将使用基于BiLSTM的順序體系結構,該體系結構從目标單詞的左右上下文中提供上下文資訊。
After loaded the dataset you’ll end with the follow length:
加載資料集後,您将以以下長度結尾:
And the structure:
以及結構:
You can see that the dataset has sentences with binary labeled words (1 if word is considered complex), we will use this to train the CWI model.
您會看到資料集包含帶有二進制标記單詞的句子(如果單詞被認為是複數,則為1),我們将使用它來訓練CWI模型。
As usual in NLP, we need to apply some preprocessing to text in order to feed the data into a deep learning model:
像在NLP中一樣,我們需要對文本進行一些預處理,以将資料輸入到深度學習模型中:
-
clean the text: delete not alphanumeric characters, lower case all words, etc.
清除文字:删除字母數字字元,小寫所有單詞等。
-
create the vocabulary.
建立詞彙表。
-
Get the embeding vectors (for this article we are using glove).
擷取嵌入向量(對于本文,我們使用手套)。
After preprocesing the dataset, we got the following:
在對資料集進行預處理之後,我們得到了以下内容:
i.e. a vocabulary with 52K words, and an embedding matrix of shape 52K x 300
例如,詞彙表包含52K個單詞,并且嵌入矩陣的形狀為52K x 300
建立CWI模型 (Create the model for CWI)
Next, we are creating the following model, using keras:
接下來,我們使用keras建立以下模型:
Is a simple BiLSTM model, we select this model for this post just for give you an idea how to approach for this problem. This model is going to be used only for indentify possible complex word in a given sentence.
這是一個簡單的BiLSTM模型,在本文中選擇此模型隻是為了給您一個解決該問題的方法。 該模型将僅用于識别給定句子中可能的複雜單詞 。
And then train it:
然後訓練它:
After trained the CWI model, we will use BERT to generate candidates over the words which the CWI model identified as complex, the candidates are based not only in the synonym of the word but in context of them.
在訓練了CWI模型之後,我們将使用BERT生成CWI模型辨別為complex的單詞的候選詞,這些候選詞不僅基于單詞的同義詞,而且還基于它們的上下文。
2.使用BERT生成候選人 (2. Generate candidates using BERT)
BERT is a powerful pretrained model from google, however this article is not about how BER works, if you need some basic knowledge about it please go to this blog.
BERT是來自Google的功能強大的預訓練模型,但是本文與BER的工作方式無關,如果您需要BER的一些基礎知識,請轉到他的部落格 。
We are using one of the BERT’s task masked language modeling (MLM), which is predicts missing tokens in a sequence given its left and right context.
我們正在使用BERT的一種任務掩蓋語言模組化 (MLM),該模型可以根據給定的左右上下文來預測序列中缺少的标記。
For example given the following sentence:
例如,給出以下語句:
The sentence is modified (masked)before put into BERT:
在放入BERT之前,對句子進行修改(屏蔽):
As we said, one of the task BERT was trained for, is to be able to predict the [MASK] word, so if we input into a BERT model, it will output the most probably word given the context, for example:
就像我們說過的那樣,BERT訓練的任務之一就是能夠預測[MASK]單詞,是以,如果我們輸入BERT模型,它将在給定上下文的情況下輸出最可能的單詞,例如:
We are masking the complex words of each sentence and get the probability distribution of the vocabulary corresponding to the masked word.
我們對每個句子的複雜詞進行掩蔽,并獲得與掩蔽詞相對應的詞彙的機率分布。
As the paper suggest, We’ll concatenate the original sequence and the sequence where we replace the complex word with mask token as a sentence pair, and feed the sentence pair into the Bert to obtain the probability distribution of the vocabulary corresponding to the mask word.
正如論文所建議的 ,我們将連接配接原始序列和将複雜單詞替換為掩碼标記作為句子對的序列,并将句子對輸入到Bert中,以獲得與掩碼詞相對應的詞彙的機率分布。
By using the sentence pair approach, we are not only consider the complex word itself, but also fit the context of the complex word:
通過使用句子對方法,我們不僅可以考慮複雜詞本身,還可以适合複雜詞的上下文:
Consider the sentence:
考慮以下句子:
The complex word in this sentence is “thwart”, to get the simplest replaces candidates we will feed it into BERT in this way:
這句話中的複雜詞是“ thwar t”,要獲得最簡單的替換候選者,我們将以這種方式将其輸入BERT:
As we see, we feed the sentence pair into the Bert to obtain the mask word, in this case BERT give us: “Stop” which is a appropiate simplest replace.
如我們所見,我們将句子對輸入到Bert中以擷取掩碼單詞,在這種情況下,BERT給我們:“ Stop”(停止),這是一個最合适的最簡單替換。
To do that, we’ll using pytorch and the function BertForMaskedLM from the very useful library hugginface, with that, using BERT’s masked language modeling is as simple as:
為了做到這一點,我們将使用pytorch從非常有用的庫函數BertForMaskedLM hugginface ,與使用BERT的蒙面語言模組化 很簡單:
3.根據Zipf值選擇最佳候選者。 (3. Select the best candidates based on Zipf values.)
We adopt the Zipf frequency of a word which is the base-10 logarithm of the number of times it appears per billion words to rank the replace word candidates. The greater the value the most common or familiar is the word for a person.
我們采用一個單詞的Zipf頻率 ,該頻率是每十億個單詞出現的次數的10個對數,以對替換單詞候選者進行排名。 一個人最常識或最熟悉的詞的價值越大。
To get you and idea, let’s get the zipf values for simple word “stop”:
為了讓您有一個想法,讓我們擷取簡單單詞“ stop ”的zipf值:
Now a more complex/uncommon word, “thwart”
現在,一個更複雜 /不常見的詞“ 挫敗 ”
Yo can see that the word “Stop” is most common than “thwart” so the chances that people are more familiar with the word “stop” are higher.
可以看到“停止”一詞比“阻止”一詞最常見,是以人們對“停止”一詞的熟悉程度更高。
Now you get the idea, for each candidate generated from BERT we compute the zipf score using the python’s package wordfreq.
現在您就知道了,對于從BERT生成的每個候選者,我們都使用python的wordfreq軟體包計算zipf得分。
In summary, for our lexical simplifier, we:
總而言之 ,對于我們的詞彙簡化器,我們:
-
Get 10 context-aware candidates word from BERT.
從BERT獲得10個上下文相關候選詞。
-
Compute their Zipf values.
計算其Zipf值。
-
Sort each candidates based in the greater zipf value.
根據較高的zipf值對每個候選項進行排序。
-
Replace each complex word with their simpler counterpart.
将每個複雜的單詞替換為其較簡單的單詞。
The snippet code for this steps:
此步驟的代碼段代碼:
for input_text in list_texts:
new_text = input_text
input_padded, index_list, len_list = process_input(input_text)
pred_cwi = model_cwi.predict(input_padded)
pred_cwi_binary = np.argmax(pred_cwi, axis = 2)
complete_cwi_predictions = complete_missing_word(pred_cwi_binary, index_list, len_list)
bert_candidates = get_bert_candidates(input_text, complete_cwi_predictions)
for word_to_replace, l_candidates in bert_candidates:
tuples_word_zipf = []
for w in l_candidates:
if w.isalpha():
tuples_word_zipf.append((w, zipf_frequency(w, 'en')))
tuples_word_zipf = sorted(tuples_word_zipf, key = lambda x: x[1], reverse=True)
new_text = re.sub(word_to_replace, tuples_word_zipf[0][0], new_text)
print("Original text: ", input_text )
print("Simplified text:", new_text, "\n")
And finally some results:
最後是一些結果 :
For example the first original sentence:
例如第一個原始句子:
“ The Risk That Students Could Arrive at School With the Coronavirus As schools grapple with how to reopen, new estimates show that large parts of the country would probably see infected students if classrooms opened now.
“學生可能會攜帶冠狀病毒到達學校的風險随着學校努力重新開放,最新估計表明,如果現在開教室,該國大部分地區可能會感染受感染的學生。”
And the simplified one, where we mark in bold the words replaces:
簡化的是用粗體标記的單詞:
“The Risk That Students Could Arrive at School With the disease As schools deal with how to open, new numbers show that large parts of the country would maybe see infected students if they opened now.”
“那學生可以到學校随着病情學校應對如何打開的風險,新的數字顯示,全國大部分地區将可能看到被感染的學生,如果他們現在開了。”
再來看看 (Let’s see one more)
The original:
原本的:
“The research does not prove that infected children are contagious, but it should influence the debate about reopening schools, some experts said.”
一些專家說:“這項研究并未證明受感染的兒童具有傳染性,但它應該影響有關重新開學的辯論。”
The simplified:
簡化的:
“The work does not show that infected children are sick, but it should change the question about open schools, some experts said.”
一些專家說:“這項工作并未表明受感染的孩子生病了,但應該改變有關開放學校的問題。”
最後的話 (Final Words)
The techniques for lexical simplification of sentences/documents, in this article, leverages on the masking language model of Bert. It focuses on the context of the complex word.
在本文中,用于簡化句子/文檔的詞彙的技術利用了Bert的掩蔽語言模型。 它着重于複雜單詞的上下文。
Experiment results have shown that this approach achieves pretty good canditates for replace into sentence and make it more simple.
實驗結果表明,該方法可以很好地實作替換成句子的候選條件,并且使其變得更加簡單。
If you want to improve, you can try another method or model for CWI, and use another score beyond zipf values for candidates raking.
如果要改進,可以嘗試CWI的另一種方法或模型,并使用zipf值以外的其他分數進行候選者抽獎。
The complete code can be found on this Jupyter notebook, and you can browse for more projects on my Github.
完整的代碼可以在Jupyter筆記本上找到,您可以在我的Github上浏覽更多項目。
Also my linkedin
也是我的Linkedin
If you need some help with Data Science’s related projects: https://www.disruptio-analytics.com/
如果您需要有關Data Science相關項目的幫助,請通路: https : //www.disruptio-analytics.com/
翻譯自: https://medium.com/@armandj.olivares/how-to-use-bert-for-lexical-simplification-6edbf5a4d15e
使用bert