Brief introduction
theFuzz is a powerful tool for fuzzy string matching and similarity calculations. It can help us to perform fuzzy matching and string comparison when processing text data, such as spelling correction, string similarity calculation, and fuzzy search, etc.
Basic principle
The basic principle of the theFuzz library is to use different algorithms to calculate the similarity between strings. One of the most commonly used algorithms is the Levenshtein distance algorithm, which measures the minimum number of edit operations (insert, delete, replace) between two strings. By calculating the Levenshtein distance between two strings, we can get their similarity.
In the TheFuzz library, we can calculate the similarity of two strings using the fuzz.ratio() function, which returns an integer between 0-100 representing the percentage of similarity of the two strings. We can also use the fuzz.partial_ratio() function to calculate the partial similarity of two strings, it will ignore the parts of the two strings that do not match, and only calculate the similar parts that appear in the two strings. In addition, there are fuzz.token_sort_ratio() and fuzz.token_set_ratio() functions that can be used to calculate the word sort similarity and word set similarity of two strings.
Installation
To install the theFuzz library, you can use the pip command. Open a terminal (command prompt) and run the following command
pip install fuzzywuzzy python-Levenshtein
Basic use
Once installed, let's look at a few code examples
from fuzzywuzzy import fuzz
str1 = "hello world"
str2 = "hello python"
similarity = fuzz.ratio(str1, str2)
print(similarity)
The result of the run is 61, which means that str1 and str2 are 61% similar
TheFuzz
from fuzzywuzzy import fuzz
str1 = "hello world"
str2 = "world hello"
partial_similarity = fuzz.partial_ratio(str1, str2)
print(partial_similarity)
Running the above code, the output is 45, which means that the partial similarity of the two strings is 45%.
TheFuzz
from fuzzywuzzy import fuzz
str1 = "hello world"
str2 = "world hello"
token_sort_similarity = fuzz.token_sort_ratio(str1, str2)
token_set_similarity = fuzz.token_set_ratio(str1, str2)
print(token_sort_similarity)
print(token_set_similarity)
Execute the above code, the output results are 100 and 100, respectively, indicating that the word order similarity and word set similarity of the two strings are 100%.
TheFuzz
References https://github.com/seatgeek/thefuzz https://github.com/seatgeek/fuzzywuzzy