前言

平時的編碼中，我們經常需要判斷兩個文本的相似性，不管是用來做文本糾錯或者去重等等，那麼我們應該以什麼次元來判斷相似性呢？這些算法又怎麼實作呢？這篇文章對常見的計算方式做一個記錄。

jaccard相似度_如何計算兩個字元串之間的文本相似度？前言Jaccard 相似度Sorensen Dice 相似度系數Levenshtein漢明距離

Jaccard 相似度

首先是 Jaccard 相似度系數，下面是它在維基百科上的一個定義及計算公式。

The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally given the French name coefficient de communauté by Paul Jaccard), is a statistic used for gauging the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:

jaccard相似度_如何計算兩個字元串之間的文本相似度？前言Jaccard 相似度Sorensen Dice 相似度系數Levenshtein漢明距離

其實總結就是一句話：集合的交集與集合的并集的比例.

java 代碼實作如下：

public static float jaccard(String a, String b) { if (a == null && b == null) { return 1f; } // 都為空相似度為 1 if (a == null || b == null) { return 0f; } Set aChar = a.chars().boxed().collect(Collectors.toSet()); Set bChar = b.chars().boxed().collect(Collectors.toSet()); // 交集數量 int intersection = SetUtils.intersection(aChar, bChar).size(); if (intersection == 0) return 0; // 并集數量 int union = SetUtils.union(aChar, bChar).size(); return ((float) intersection) / (float)union; }

Sorensen Dice 相似度系數

與 Jaccard 類似，Dice 系數也是一種計算簡單集合之間相似度的一種計算方式。與 Jaccard 不同的是，計算方式略有不同。下面是它的定義。

The Sørensen–Dice coefficient (see below for other names) is a statistic used to gauge the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen[1] and Lee Raymond Dice,[2] who published in 1948 and 1945 respectively.

jaccard相似度_如何計算兩個字元串之間的文本相似度？前言Jaccard 相似度Sorensen Dice 相似度系數Levenshtein漢明距離

需要注意的是，他是：集合交集的 2 倍除以兩個集合相加。并不是并集.

java 代碼實作如下：

public static float SorensenDice(String a, String b) { if (a == null && b == null) { return 1f; } if (a == null || b == null) { return 0F; } Set aChars = a.chars().boxed().collect(Collectors.toSet()); Set bChars = b.chars().boxed().collect(Collectors.toSet()); // 求交集數量 int intersect = SetUtils.intersection(aChars, bChars).size(); if (intersect == 0) { return 0F; } // 全集，兩個集合直接加起來 int aSize = aChars.size(); int bSize = bChars.size(); return (2 * (float) intersect) / ((float) (aSize + bSize)); }

Levenshtein

萊文斯坦距離，又稱 Levenshtein 距離，是編輯距離的一種。指兩個字串之間，由一個轉成另一個所需的最少編輯操作次數。

簡單的說，就是用編輯距離表示字元串相似度, 編輯距離越小，字元串越相似。

java 實作代碼如下：

public static float Levenshtein(String a, String b) { if (a == null && b == null) { return 1f; } if (a == null || b == null) { return 0F; } int editDistance = editDis(a, b); return 1 - ((float) editDistance / Math.max(a.length(), b.length())); } private static int editDis(String a, String b) { int aLen = a.length(); int bLen = b.length(); if (aLen == 0) return aLen; if (bLen == 0) return bLen; int[][] v = new int[aLen + 1][bLen + 1]; for (int i = 0; i <= aLen; ++i) { for (int j = 0; j <= bLen; ++j) { if (i == 0) { v[i][j] = j; } else if (j == 0) { v[i][j] = i; } else if (a.charAt(i - 1) == b.charAt(j - 1)) { v[i][j] = v[i - 1][j - 1]; } else { v[i][j] = 1 + Math.min(v[i - 1][j - 1], Math.min(v[i][j - 1], v[i - 1][j])); } } } return v[aLen][bLen]; }

代碼中的編輯距離求解使用了經典的動态規劃求解法。

我們使用了** 1 - ( 編輯距離 / 兩個字元串的最大長度) ** 來表示相似度，這樣可以得到符合我們語義的相似度。

漢明距離

漢明距離是編輯距離中的一個特殊情況，僅用來計算兩個等長字元串中不一緻的字元個數。

是以漢明距離不用考慮添加及删除，隻需要對比不同即可，是以實作比較簡單。

我們可以用similarity=漢明距離/長度來表示兩個字元串的相似度。

java 代碼如下：

public static float hamming(String a, String b) { if (a == null || b == null) { return 0f; } if (a.length() != b.length()) { return 0f; } int disCount = 0; for (int i = 0; i < a.length(); i++) { if (a.charAt(i) != b.charAt(i)) { disCount++; } } return (float) disCount / (float) a.length(); }

下面是測試用例：

Assert.assertEquals(0.0f, StringSimilarity.hamming("java 開發

jaccard相似度_如何計算兩個字元串之間的文本相似度？前言Jaccard 相似度Sorensen Dice 相似度系數Levenshtein漢明距離

前言

Jaccard 相似度

Sorensen Dice 相似度系數

Levenshtein

漢明距離

繼續閱讀

jaccard相似度_相似度聚類

jaccard相似度_相似度距離度量公式

2000坐标系轉經緯度工具_經緯度也分54和80、2000嗎？

jaccard相似度_word,sentence相似度計算，WMD&amp;WRD

jaccard相似度_自然語言處理之文本相似度計算

c字元串截取一部分字元串_【Go】高效截取字元串的一些思考

c字元串截取一部分字元串_Shell處理字元串常用技巧截取子串簡單案例

.net mvc actionresult 傳回字元串_mysql 字元串定位、字元串截取

java 根據圓心計算圓弧上點的經緯度_【控制測量學】-高斯投影正算公式以及java代碼