簡介

theFuzz 是一個用于模糊字元串比對和相似度計算的強大工具。它可以幫助我們在處理文本資料時進行模糊比對和字元串比較，例如拼寫糾正、字元串相似度計算和模糊搜尋等。

基本原理

theFuzz 庫的基本原理是使用不同的算法來計算字元串之間的相似度。其中最常用的算法是 Levenshtein 距離算法，它衡量了兩個字元串之間的編輯操作（插入、删除、替換）的最小數量。通過計算兩個字元串之間的 Levenshtein 距離，我們可以得到它們的相似度。

在 TheFuzz 庫中，我們可以使用 fuzz.ratio() 函數來計算兩個字元串的相似度，它傳回一個 0-100 之間的整數，表示兩個字元串的相似度百分比。我們還可以使用 fuzz.partial_ratio() 函數來計算兩個字元串的部分相似度，它會忽略兩個字元串中沒有比對的部分，隻計算出現在兩個字元串中的相似部分。此外，還有 fuzz.token_sort_ratio() 和 fuzz.token_set_ratio() 函數可以用于計算兩個字元串的單詞排序相似度和單詞集合相似度。

安裝

要安裝 theFuzz 庫，可以使用 pip 指令。打開終端（指令提示符）并運作以下指令

pip install fuzzywuzzy python-Levenshtein

基本使用

安裝好後，我們來看幾個代碼示例

from fuzzywuzzy import fuzz

str1 = "hello world"
str2 = "hello python"

similarity = fuzz.ratio(str1, str2)
print(similarity)

運作結果是61，表示 str1 和 str2 相似度為 61%

TheFuzz

from fuzzywuzzy import fuzz

str1 = "hello world"
str2 = "world hello"

partial_similarity = fuzz.partial_ratio(str1, str2)
print(partial_similarity)

運作上述代碼，輸出結果為 45，表示兩個字元串的部分相似度為45%。

TheFuzz

from fuzzywuzzy import fuzz

str1 = "hello world"
str2 = "world hello"

token_sort_similarity = fuzz.token_sort_ratio(str1, str2)
token_set_similarity = fuzz.token_set_ratio(str1, str2)

print(token_sort_similarity)
print(token_set_similarity)

執行上述代碼，輸出結果分别為100和100，表示兩個字元串的單詞排序相似度和單詞集合相似度都為100%。

TheFuzz

參考資料 https://github.com/seatgeek/thefuzz https://github.com/seatgeek/fuzzywuzzy

Python中的一個字元串模糊比對開源庫

簡介

基本原理

安裝

基本使用