天天看点

PostgreSQL里的17种文本相似算法与GIN索引 - pg_similarity

postgresql , 文本相似 , pg_similarity , pg_trgm , rum , fuzzymatch gin , smlar

文本相似算法,结合postgresql的开放索引框架gin,可以实现各种相似算法的文本高效检索。

postgresql中常见的文本相似搜索插件:rum, pg_trgm, fuzzymatch, pg_similarity, smlar。

其中pg_similarity支持的算法达到了17种。

introduction

pg_similarity is an extension to support similarity queries on postgresql.

the implementation is tightly integrated in the rdbms in the sense that it defines operators

so instead of the traditional operators (= and <>) you can use ~~~ and ! (any of these

operators represents a similarity function).

pg_similarity has three main components:

functions:

a set of functions that implements similarity algorithms available in the literature.

these functions can be used as udfs and, will be the base for implementing the similarity operators;

operators:

a set of operators defined at the top of similarity functions.

they use similarity functions to obtain the similarity threshold and,

compare its value to a user-defined threshold to decide if it is a match or not;

session variables:

a set of variables that store similarity function parameters. theses variables can be defined at run time.

l1 distance (as known as city block or manhattan distance);

cosine distance;

dice coefficient;

euclidean distance;

hamming distance;

jaccard coefficient;

jaro distance;

jaro-winkler distance;

levenshtein distance;

matching coefficient;

monge-elkan coefficient;

needleman-wunsch coefficient;

overlap coefficient;

q-gram distance;

smith-waterman coefficient;

smith-waterman-gotoh coefficient;

soundex distance.

用法如下

algorithm

function

operator

use index?

parameters

l1 distance

block(text, text) returns float8

~++

yes

pg_similarity.block_tokenizer (enum)

pg_similarity.block_threshold (float8)

pg_similarity.block_is_normalized (bool)

cosine distance

cosine(text, text) returns float8

~##

pg_similarity.cosine_tokenizer (enum)

pg_similarity.cosine_threshold (float8)

pg_similarity.cosine_is_normalized (bool)

dice coefficient

dice(text, text) returns float8

~-~

pg_similarity.dice_tokenizer (enum)

pg_similarity.dice_threshold (float8)

pg_similarity.dice_is_normalized (bool)

euclidean distance

euclidean(text, text) returns float8

~!!

pg_similarity.euclidean_tokenizer (enum)

pg_similarity.euclidean_threshold (float8)

pg_similarity.euclidean_is_normalized (bool)

hamming distance

hamming(bit varying, bit varying) returns float8

hamming_text(text, text) returns float8

~@~

no

pg_similarity.hamming_threshold (float8)

pg_similarity.hamming_is_normalized (bool)

jaccard coefficient

jaccard(text, text) returns float8

~??

pg_similarity.jaccard_tokenizer (enum)

pg_similarity.jaccard_threshold (float8)

pg_similarity.jaccard_is_normalized (bool)

jaro distance

jaro(text, text) returns float8

~%%

pg_similarity.jaro_threshold (float8)

pg_similarity.jaro_is_normalized (bool)

jaro-winkler distance

jarowinkler(text, text) returns float8

~@@

pg_similarity.jarowinkler_threshold (float8)

pg_similarity.jarowinkler_is_normalized (bool)

levenshtein distance

lev(text, text) returns float8

~==

pg_similarity.levenshtein_threshold (float8)

pg_similarity.levenshtein_is_normalized (bool)

matching coefficient

matchingcoefficient(text, text) returns float8

~^^

pg_similarity.matching_tokenizer (enum)

pg_similarity.matching_threshold (float8)

pg_similarity.matching_is_normalized (bool)

monge-elkan coefficient

mongeelkan(text, text) returns float8

~||

pg_similarity.mongeelkan_tokenizer (enum)

pg_similarity.mongeelkan_threshold (float8)

pg_similarity.mongeelkan_is_normalized (bool)

needleman-wunsch coefficient

needlemanwunsch(text, text) returns float8

~#~

pg_similarity.nw_threshold (float8)

pg_similarity.nw_is_normalized (bool)

overlap coefficient

overlapcoefficient(text, text) returns float8

~**

pg_similarity.overlap_tokenizer (enum)

pg_similarity.overlap_threshold (float8)

pg_similarity.overlap_is_normalized (bool)

q-gram distance

qgram(text, text) returns float8

~~~

pg_similarity.qgram_threshold (float8)

pg_similarity.qgram_is_normalized (bool)

smith-waterman coefficient

smithwaterman(text, text) returns float8

~=~

pg_similarity.sw_threshold (float8)

pg_similarity.sw_is_normalized (bool)

smith-waterman-gotoh coefficient

smithwatermangotoh(text, text) returns float8

~!~

pg_similarity.swg_threshold (float8)

pg_similarity.swg_is_normalized (bool)

soundex distance

soundex(text, text) returns float8

~*~

<a href="http://pgsimilarity.projects.pgfoundry.org/">http://pgsimilarity.projects.pgfoundry.org/</a>

<a href="https://github.com/eulerto/pg_similarity">https://github.com/eulerto/pg_similarity</a>

<a href="https://www.pgcon.org/2009/schedule/attachments/108_pg_similarity.pdf">https://www.pgcon.org/2009/schedule/attachments/108_pg_similarity.pdf</a>

<a href="http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary">http://www.sigaev.ru/git/gitweb.cgi?p=smlar.git;a=summary</a>

<a href="https://github.com/postgrespro/rum">https://github.com/postgrespro/rum</a>

<a href="https://www.postgresql.org/docs/9.6/static/fuzzystrmatch.html">https://www.postgresql.org/docs/9.6/static/fuzzystrmatch.html</a>

<a href="https://www.postgresql.org/docs/9.6/static/pgtrgm.html">https://www.postgresql.org/docs/9.6/static/pgtrgm.html</a>

继续阅读