【NLP學習筆記】（三）gensim使用之相似性查詢（Similarity Queries）

2018-12-11 23:50:00

相似性查詢（Similarity Queries）

本文主要翻譯自

https://radimrehurek.com/gensim/tut3.html

在之前的教程

語料和向量空間

和

主題和轉換

中，我們學會了如何在向量空間模型中表示語料和如何在不同的向量空間之間轉換。實際工作中，這樣做的一個最常見的目的是比較兩個文檔之間的相似性或比較某一個文檔與其它文檔的相似性（比如使用者查詢已經索引的文檔中的某一個文檔）

加載字典和語料

與上一章相同，首先加載第一章中儲存的字典和語料。

from gensim import corpora, models, similarities
import os
if(os.path.exists('./gensim_out/deerwester.dict')):
    dictionary = corpora.Dictionary.load('./gensim_out/deerwester.dict')
    corpus = corpora.MmCorpus('./gensim_out/deerwester.mm')
    print("使用之前已經存儲的字典和語料向量")
else:
    print("請先通過第一章生成deerwester.dict和deerwester.mm")

第一步

定義模型LSI，并将語料corpus轉換為索引

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

index = similarities.MatrixSimilarity(lsi[corpus])
index.save('./gensim_out/deerwester.index') #儲存訓練後的index
index = similarities.MatrixSimilarity.load('./gensim_out/deerwester.index')#從已儲存的檔案中加載index。

第二步

假設我們要查詢新文本 'human computer interaction'。我們期望得出與新文本最相思的三個文本。

doc = 'human computer interaction'
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
print(vec_lsi)

第三步

比較新文本vec_lsi與語料庫的相似性

sims = index[vec_lsi]
print(list(enumerate(sims))) #列印結果(document_number, document_similarity) 2-tuples

上面結果為：

[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945),(5, -0.12416792), (6, -0.1063926), (7, -0.098794639), (8, 0.05004178)]

(0, 0.99809301)的意思是第0篇文章與新文檔的相似性為 0.99809301

将上面結果按相似性降序排列

sims = sorted(enumerate(sims), key = lambda item : -item[1])
print(sims)

結果：

[(2, 0.99844527), # The EPS user interface management system

(0, 0.99809301), # Human machine interface for lab abc computer applications

(3, 0.9865886), # System and human system engineering testing of EPS

(1, 0.93748635), # A survey of user opinion of computer system response time

(4, 0.90755945), # Relation of user perceived response time to error measurement

(8, 0.050041795), # Graph minors A survey

(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering

(6, -0.1063926), # The intersection graph of paths in trees

(5, -0.12416792)] # The generation of random binary unordered trees

可以看出與文檔“human computer interface”最相似的三篇文章分别是第2篇、第0篇、第三篇。

【NLP學習筆記】（三）gensim使用之相似性查詢（Similarity Queries）

相似性查詢（Similarity Queries）

加載字典和語料

第一步

第二步

第三步

繼續閱讀

資料庫設計理論及應用（4）——概念結構設計1．概念模型 2．銷售子系統的分E-R圖 3．視圖的內建 4．設計基本E-R圖

資料流圖的設計

資料庫規範化設計理論摘要要

黑馬程式員——C#結構及常用基本類型

試分析如何把數組array中的所有元素循環右移p位

Flash AS3 連續加載外部若幹圖檔

DB2表壓縮功能

華為筆試軟體

項目管理那些事兒

OS --written test1

OS-written test2

壓縮編碼M-JPEG、MPEG4、H.264

轉詳解C#資料庫存取圖檔三大方式

BMP檔案結構及圖像每行位元組計算方法

磁盤結構及在Linux中的命名

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合