如何解決資料庫分詞的拼寫糾正問題 - PostgreSQL Hunspell 字典複數形容詞動詞等變異還原

2021-11-07 15:50:11

postgresql , hunspell , 分詞 , 複數還原 , 字典

在英語中，名詞通常都有複數，表示多個；形容詞，過去式，動詞等。有large, larger, largest, stories, eating, did, doing, hacked這樣的。

這可能會給分詞帶來一定的困擾，例如我們來看看pg預設的ts config怎麼處理它的。

比如english tsconfig是這麼處理的

很顯然，它沒有很好的處理這幾個詞, large, larger, largest, stories。

預設的parser支援的token類型

實際上從postgresql 9.6開始，就支援了拼寫的糾正字典，參考

<a href="https://www.postgresql.org/docs/9.6/static/textsearch-dictionaries.html#textsearch-ispell-dictionary">https://www.postgresql.org/docs/9.6/static/textsearch-dictionaries.html#textsearch-ispell-dictionary</a>

通過affix, dict檔案進行糾正。

例子

postgrespro開源了一個插件，實作了一些國家語言的fix , 可以用來處理這類拼寫糾正。

<a href="https://github.com/postgrespro/hunspell_dicts">https://github.com/postgrespro/hunspell_dicts</a>

目前支援的幾個字典如下

module

dictionary

configuration

hunspell_de_de

german_hunspell

hunspell_en_us

english_hunspell

hunspell_fr

french_hunspell

hunspell_nl_nl

dutch_hunspell

hunspell_nn_no

norwegian_hunspell

hunspell_ru_ru

russian_hunspell

通過子產品安裝這些字典

解決複數，形容詞問題

一個小的插件，反映的是postgresql社群生态，以及pg社群圈子熱衷貢獻的精神。還有很多很多這樣的例子，在程式實作要花不少時間的問題，可能在pg圈就能找到插件幫你解決。快來用pg吧。

如何解決資料庫分詞的拼寫糾正問題 - PostgreSQL Hunspell 字典複數形容詞動詞等變異還原

繼續閱讀

set define off關閉替代變量功能

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

報錯：'mysql' 不是内部或外部指令，也不是可運作的程式或批處理檔案。

Linxu常用指令技巧彙總

ERROR 1 (HY000): Can't create/write to file '/tmp/#sql_4188_1.MYI' (Errcode: 28)

艱難安裝LDAP,SSL認證

《Linux指令行與Shell腳本程式設計大全第2版.布盧姆》pdf

MySQL的4種隔離級别？出現問題

XX系統實施過程問題總結

無元件上傳圖檔到資料庫中，最完整解決方案

【MySQL資料庫】資料庫索引事務1.索引2.事務

neo4j之cypher使用文檔

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

登入plsql 報錯 the account is locked --使用者被鎖

SequoiaDB巨杉資料庫C++驅動概述

如何解決資料庫分詞的拼寫糾正問題 - PostgreSQL Hunspell 字典 複數形容詞動詞等變異還原

繼續閱讀

如何解決資料庫分詞的拼寫糾正問題 - PostgreSQL Hunspell 字典複數形容詞動詞等變異還原