單詞拼寫糾錯

原文連結: http://chenhao.space/post/409250ae.html

所需資料集:

spell-errors.txt

testdata.txt

vocab.txt

詞典庫

# 詞典庫
vocab = set([line.rstrip() for line in open('vocab.txt')]) # 用set效率高一些(時間複雜度)

需要生成所有候選集合

# 需要生成所有候選集合
def generate_candidates(word):
    '''
    word: 給定的輸入(錯誤的輸入)
    傳回所有(valid)候選集合
    '''
    # 生成編輯距離為1的單詞
    # 1.insert 2.delete 3.replace
    # 假設使用26個字元
    letters = 'abcdefghijklmnopqrstuvwxyz'
    
    splits = [(word[:i], word[i:]) for i in range(len(word)+1)]
    # insert 操作
    inserts = [L+c+R for L, R in splits for c in letters] # insert操作後的結果
    # delete 操作
    deletes = [L+R[1:] for L, R in splits if R]   # if R 判斷R不為空
    # replace
    replaces = [L+c+R[1:] for L, R in splits if R for c in letters] # 把c替換為R中删掉的字元
    
    candidates = set(inserts+deletes+replaces)
    
    # 過濾掉不存在于詞典庫裡面的單詞
    return [word for word in candidates if word in vocab]

讀取語料庫

from nltk.corpus import reuters

# 讀取語料庫
categories = reuters.categories()   # 路透社語料庫的類别
corpus = reuters.sents(categories=categories) # sents()指定分類中的句子

建構語言模型

# 建構語言模型：bigram
term_count = {}
bigram_count = {}
for doc in corpus:
    doc = ['<s>'] + doc   # '<s>'表示開頭
    for i in range(0, len(doc)-1):
        # bigram: [i,i+1]
        term = doc[i]          # term是doc中第i個單詞
        bigram = doc[i:i+2]    # bigram為第i,i+1個單詞組成的
        
        if term in term_count:
            term_count[term] += 1   # 如果term存在term_count中，則加1
        else:
            term_count[term] = 1    # 如果不存在，則添加，置為1
        
        bigram = ' '.join(bigram)
        if bigram in bigram_count:
            bigram_count[bigram] += 1
        else:
            bigram_count[bigram] = 1

使用者打錯的機率

# 使用者打錯的機率統計 - channel probability
channel_prob = {}

for line in open('spell-errors.txt'):
    items = line.split(":")  # 按":"分割成兩個子項(在items清單中)
    correct = items[0].strip()
    mistakes = [item.strip() for item in items[1].strip().split(",")]
    channel_prob[correct] = {}
    for mis in mistakes:
        channel_prob[correct][mis] = 1.0/len(mistakes)

糾錯

import numpy as np

V = len(term_count.keys())

file = open("testdata.txt", 'r')
for line in file:
    items = line.rstrip().split('\t')   # 句子存放在items[2]中
    line = items[2].split()    # 得到句子中所有單詞
    for word in line:
        if word not in vocab:   # 若單詞不存在詞典中，則是錯誤的單詞
            # 需要替換word成正确的單詞
            # Step1: 生成所有的(valid)候選集合
            candidates = generate_candidates(word)
            
            # 若candidates為空
            # 一種方式：多生成幾個candidates，比如編輯距離不大于2
            # TODO: 根據條件生成更多的候選集合
            if len(candidates) < 1:
                continue   # 不建議這麼做
                
            probs = []
            
            # 對于每一個candidate，計算它的score
            # score = p(correct)*p(mistake|correct)
            #       = log p(correct) + log p(mistake|correct)
            # 傳回 score 最大的 candidate
            for candi in candidates:
                prob = 0
                # a. 計算 channel probability
                if candi in channel_prob and word in channel_prob[candi]:
                    prob += np.log(channel_prob[candi][word])
                else:   # word not in channel_prob[candi]
                    prob += np.log(0.0001)
            
                # b. 計算語言模型的機率
                idx = items[2].index(word)+1   # +1是因為每句前面加了<s>
                if items[2][idx-1] in bigram_count and condi in bigram_count[items[2][idx-1]]:
                    prob += np.log((bigram_count[items[2][idx-1]][candi] + 1.0) / (term_count[bigram_count[items[2][idx - 1]]] + V))
                # TODO: 也要考慮目前[word，post_word]，上面隻考慮了[pre_word, word]
                # prob += np.log(bigram機率)
                else:
                    prob += np.log(1.0 / V)
                    
                probs.append(prob)
            max_idx = probs.index(max(probs))
            print(word, candidates[max_idx])

結果

protectionst protectionist
products. products
long-run, long-run
gain. gain
17, 17
retaiation retaliation
cost. coste
busines, business
ltMC.T. ltMC.T
U.S., U.S.
Murtha, Murtha
worried. worried
seriousnyss seriousness
aganst against
us, us
named. named
year, year
sewll sell
dlrs, dlrs
world's worlds
largest. largest
markets, markets
importsi imports
Products, Products
Retaliation, Retaliation
Group. Group
Korea's Koreans
Korea, Korea
Japan. Japan
Koreva Korea
U.S., U.S.
1985. 1985
Malaysia, Malaysian
......

附: 語料庫基本函數表

示例	描述
fileids()	語料庫中的檔案
fileids([categories])	這些分類對應的語料庫中的檔案
categories()	語料庫中的分類
categories([fileids])	這些檔案對應的語料庫中的分類
raw()	語料庫的原始内容
raw(fileids=[f1,f2,f3])	指定檔案的原始内容
raw(categories=[c1,c2])	指定分類的原始内容
words()	整個語料庫中的詞彙
words(fileids=[f1,f2,f3])	指定檔案中的詞彙
words(categories=[c1,c2])	指定分類中的詞彙
sents()	指定分類中的句子
sents(fileids=[f1,f2,f3])	指定檔案中的句子
sents(categories=[c1,c2])	指定分類中的句子
abspath(fileid)	指定檔案在磁盤上的位置
encoding(fileid)	檔案的編碼（如果知道的話）
open(fileid)	打開指定語料庫檔案的檔案流
root()	到本地安裝的語料庫根目錄的路徑

單詞拼寫糾錯

詞典庫

需要生成所有候選集合

讀取語料庫

建構語言模型

使用者打錯的機率

糾錯

結果

附: 語料庫基本函數表

繼續閱讀

文本分類之 residual-connection+selfAttention的詞向量平均模型

新聞文本分類-06 基于Bert的文本分類

seq2seq模型 + Attention機制

elasticlunr.js 最新版本v0.6.7釋出啦應用示例為什麼你需要elasticlunr.js?

RNN/LSTM學習資料總結

使用中文維基百科進行GloVe實驗

從詞向量衡量标準到全局向量的詞嵌入模型GloVe再到一詞多義的解決方式衡量标準Evaluation引子全局向量的詞嵌入應用對一詞多義的思考Reference

NLP︱進階詞向量表達（一）——GloVe（理論、相關測評結果、R&python實作、相關應用）一、理論簡述二、測評三、Glove實作&R&python四、相關應用

GloVe與word2vec的差別，及GloVe的缺陷

更别緻的詞向量模型(一)：simpler glove

glove_python安裝（避免編譯錯誤）

python 分析qq聊天記錄

[一起學BERT]（一）：BERT模型的原理基礎Self-Attention機制理論Multi-head Self-Attention注意力機制位置編碼Transformer理論BERT理論

ELMO BERT GPT

BERT、Elmo、GPT一、發展曆史二、bert三、ERNIE四、GPT—transformer的decoder

人工智能如何有效地運用于自然語言處理