ElasticSearch配置IK靈活比對單個漢字與詞組

1. 環境說明

elasticsearch7.9.3
elasticsearch-analysis-ik-7.9.3
kibana7.9.3(與此需求無關)

2. 分析思路

由于es在存儲資料時如果使用ik分詞器, 進行如下配置:

{
  "settings": {
    "number_of_replicas": 1,
    "number_of_shards": 5
  },
  "mappings": {
    "properties": {
    "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart", 
        "index": true,
        "store": false
      }
      }
}

在分詞過程中, 預設IK分詞器隻會處理分詞, 但是單個字是不會變成term儲存進倒排表的
是以如果要做單個字的全文檢索, 就需要增加額外字典
對檢索進行優化, 對檢索詞會進行最大粒度分詞, 比如: 在檢索:"手機殼"的時候, 就不會将"手機殼"拆分為"手機"和"手機殼"等, 避免搜尋手機殼的時候出現手機的結果

3. analysis-ik配置

修改配置檔案elasticsearch-7.9.3\plugins\elasticsearch-analysis-ik-7.9.3\config
配置中相對路徑都是以config下路徑

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 擴充配置</comment>
    <!--使用者可以在這裡配置自己的擴充字典 -->
    <entry key="ext_dict">extra_single_word.dic</entry>
     <!--使用者可以在這裡配置自己的擴充停止詞字典-->
    <entry key="ext_stopwords"></entry>
    <!--使用者可以在這裡配置遠端擴充字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--使用者可以在這裡配置遠端擴充停止詞字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

4. 重新開機ES并重建索引

再一次建立以下索引:

{
  "settings": {
    "number_of_replicas": 1,
    "number_of_shards": 5
  },
  "mappings": {
    "properties": {
    "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart", 
        "index": true,
        "store": false
      }
      }
}

此時, 再存入的資料會根據IK分詞器+額外字典(單字字典)進行分詞。

kibana測試分詞

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": ["我們是共産主義接班人。"]
}

分詞結果

{
  "tokens" : [
    {
      "token" : "我們",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "們",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "共産主義",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "共産",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "産",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "主義",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "主",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "義",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "接班人",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "接班",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "接",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 13
    },
    {
      "token" : "班",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "人",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 15
    }
  ]
}

參考文章:

https://blog.csdn.net/nazeniwaresakini/article/details/104220237

ElasticSearch配置IK靈活比對單個漢字與詞組

1. 環境說明

2. 分析思路

3. analysis-ik配置

4. 重新開機ES并重建索引

繼續閱讀

資料庫設計理論及應用（4）——概念結構設計1．概念模型 2．銷售子系統的分E-R圖 3．視圖的內建 4．設計基本E-R圖

資料流圖的設計

資料庫規範化設計理論摘要要

黑馬程式員——C#結構及常用基本類型

試分析如何把數組array中的所有元素循環右移p位

Flash AS3 連續加載外部若幹圖檔

DB2表壓縮功能

華為筆試軟體

項目管理那些事兒

OS --written test1

OS-written test2

壓縮編碼M-JPEG、MPEG4、H.264

轉詳解C#資料庫存取圖檔三大方式

BMP檔案結構及圖像每行位元組計算方法

磁盤結構及在Linux中的命名

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合