天天看點

ElasticSearch配置IK靈活比對單個漢字與詞組

1. 環境說明

  • elasticsearch7.9.3
  • elasticsearch-analysis-ik-7.9.3
  • kibana7.9.3(與此需求無關)

2. 分析思路

  • 由于es在存儲資料時如果使用ik分詞器, 進行如下配置:
{
  "settings": {
    "number_of_replicas": 1,
    "number_of_shards": 5
  },
  "mappings": {
    "properties": {
    "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart", 
        "index": true,
        "store": false
      }
      }
}      
  • 在分詞過程中, 預設IK分詞器隻會處理分詞, 但是單個字是不會變成term儲存進倒排表的
  • 是以如果要做單個字的全文檢索, 就需要增加額外字典
  • 對檢索進行優化, 對檢索詞會進行最大粒度分詞, 比如: 在檢索:"手機殼"的時候, 就不會将"手機殼"拆分為"手機"和"手機殼"等, 避免搜尋手機殼的時候出現手機的結果

3. analysis-ik配置

  • 修改配置檔案elasticsearch-7.9.3\plugins\elasticsearch-analysis-ik-7.9.3\config
  • 配置中相對路徑都是以config下路徑
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 擴充配置</comment>
    <!--使用者可以在這裡配置自己的擴充字典 -->
    <entry key="ext_dict">extra_single_word.dic</entry>
     <!--使用者可以在這裡配置自己的擴充停止詞字典-->
    <entry key="ext_stopwords"></entry>
    <!--使用者可以在這裡配置遠端擴充字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--使用者可以在這裡配置遠端擴充停止詞字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>      

4. 重新開機ES并重建索引

再一次建立以下索引:

{
  "settings": {
    "number_of_replicas": 1,
    "number_of_shards": 5
  },
  "mappings": {
    "properties": {
    "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart", 
        "index": true,
        "store": false
      }
      }
}      

此時, 再存入的資料會根據IK分詞器+額外字典(單字字典)進行分詞。

kibana測試分詞

POST _analyze
{
  "analyzer": "ik_max_word",
  "text": ["我們是共産主義接班人。"]
}      

分詞結果

{
  "tokens" : [
    {
      "token" : "我們",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "們",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "是",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "共産主義",
      "start_offset" : 3,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "共産",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "産",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "主義",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "主",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "義",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "接班人",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "接班",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "接",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 13
    },
    {
      "token" : "班",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "人",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 15
    }
  ]
}      

參考文章:

https://blog.csdn.net/nazeniwaresakini/article/details/104220237

繼續閱讀