文章目錄
- 1、簡介
- 2、計算過程
- 3、效果圖
- 4、核心代碼
- 5、此項目Github源碼分享
1、簡介
最近一直在研究NLP的文本相似度算法,本文将利用TF-IDF特征向量和Simhash指紋計算中文文本的相似度。
2、計算過程
- 準備測試資料
- 預處理讀到的資料
- 加載資料到Map中
- 輸入使用者問題
- 利用TF特征向量和Simhash指紋計算出 預處理的配置檔案中的分值
3、效果圖
4、核心代碼
try:
text = re_test.run(question) # 通過正則 查找比對資料
doc_token = jt.tokens(text) # 預處理,分詞
doc_feat = fb.compute(doc_token)
doc_fl = DocFeatLoader(smb, doc_feat) # 對象包含兩個參數 # fingerprint 指紋分值 # feat_vec 包含元組的清單
# 預處理後的配置檔案
contentFlListMap = nodeMap
p_score_list = []
if nodeId in contentFlListMap.keys():
nodeFlList = contentFlListMap[nodeId]
print("nodeFilist",nodeFlList)
for i in range(len(nodeFlList)):
p_score_dict={}
dist = cosine_distance_nonzero(nodeFlList[i]["lableDataFeatureVector"].feat_vec, doc_fl.feat_vec, norm=False)
p_score_dict["score"] = dist
p_score_dict["labelData"] = nodeFlList[i]["labelData"]
p_score_dict["targetNodeId"] = nodeFlList[i]["targetNodeId"]
p_score_dict["conditionId"] = nodeFlList[i]["conditionId"]
p_score_list.append(p_score_dict)
p_score_list = sorted(p_score_list, key=lambda score : score["score"], reverse=True)
print("Sorted:",p_score_list)
Complete_MayBeL4 = []
Complete_MayBeL4Score = []
Complete_MayBeL4ID = []
Complete_MayBeL4Max = 3
for i, el in enumerate(p_score_list):
p_label = p_score_list[i]["labelData"]
p_score = p_score_list[i]["score"]
p_conditionId = p_score_list[i]["conditionId"]
if len(Complete_MayBeL4) < Complete_MayBeL4Max:
Complete_MayBeL4.append(p_label)
Complete_MayBeL4Score.append(p_score)
Complete_MayBeL4ID.append(p_conditionId)
else:
break
print("************************************")
print("使用者問題:", question)
print("相似問(Max=%s):%s"%(Complete_MayBeL4Max,Complete_MayBeL4))
print("特征值(Max=%s):%s"%(Complete_MayBeL4Max,Complete_MayBeL4Score))
print("ID:",Complete_MayBeL4ID)
return "", "", "", "", "", ""
except Exception as e:
print("************************************")
print("Error textSimilarity:", str(e))
print("************************************")