準确的句子相似度系數Jaccard系數(含python代碼)

2023-04-10 14:55:33

1、jaccard系數的定義就兩個句子的交集除以句子的并集，網上基本有兩種實作，但是其實一種不太準确的。

第一種：利用了sklearn的CounterVectorizer類和numpy。

def add_space(s):

if isinstance(s,float):

s=str(s)

# pass # do something sensible with floats here

# return # something sensible

return ' '.join(list(s))

def jaccard_similarity(s1, s2):

print(s1,s2)

# 将字中間加入空格

s1, s2 = add_space(s1), add_space(s2)

# 轉化為TF矩陣

cv = CountVectorizer(tokenizer=lambda s: s.split())

print(cv.tokenizer)

corpus = [s1, s2]

vectors = cv.fit_transform(corpus).toarray()

print(vectors)

# 求交集

numerator = np.sum(np.min(vectors, axis=0))

# 求并集

denominator = np.sum(np.max(vectors, axis=0))

# 計算傑卡德系數

return 1.0 * numerator / denominator

第二種，主要用的set和list并交集。

def jaccard_sim(a, b):

print(set(a).union(set(b)))

unions = len(set(a).union(set(b)))

print(unions)

intersections = len(set(a).intersection(set(b)))

print(set(a).intersection(set(b)))

print(intersections)

return intersections / unions

a,b="app怎麼綁定銀行卡","app哪裡綁定銀行卡"

第一種計算jaccard相似度為：0.6666666，第二種結果是0.63636

第一種是準确的，因為句子裡重複的字應該要都算上。第二種直接去掉了。

準确的句子相似度系數Jaccard系數(含python代碼)

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入