ElasticSearch的match和match_phrase查詢

2023-03-19 04:55:35

問題：

索引中有『第十人民醫院』這個字段，使用IK分詞結果如下 :

POST http://localhost:9200/development_hospitals/_analyze?pretty&field=hospital.names&analyzer=ik

{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "十人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "十",
      "start_offset": 1,
      "end_offset": 2,
      "type": "TYPE_CNUM",
      "position": 2
    },
    {
      "token": "人民醫院",
      "start_offset": 2,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人",
      "start_offset": 2,
      "end_offset": 3,
      "type": "COUNT",
      "position": 5
    },
    {
      "token": "民醫院",
      "start_offset": 3,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "醫院",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    }
  ]
}

使用Postman建構match查詢：

ElasticSearch的match和match_phrase查詢

可以得到結果，但是使用match_phrase查詢『第十』卻沒有任何結果

問題分析：

參考文檔 The Definitive Guide [2.x] | Elastic

phrase搜尋跟關鍵字的位置有關, 『第十』采用ik_max_word分詞結果如下

{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "十",
      "start_offset": 1,
      "end_offset": 2,
      "type": "TYPE_CNUM",
      "position": 1
    }
  ]
}

雖然『第十』和『十』都可以命中，但是match_phrase的特點是分詞後的相對位置也必須要精準比對，『第十人民醫院』采用id_max_word分詞後，『第十』和『十』之間有一個『十人』，是以無法命中。

解決方案：

采用ik_smart分詞可以避免這樣的問題，對『第十人民醫院』和『第十』采用ik_smart分詞的結果分别是：

{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "人民醫院",
      "start_offset": 2,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

{
  "tokens": [
    {
      "token": "第十",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}

穩穩命中

最佳實踐：

采用match_phrase比對，結果會非常嚴格，但是也會漏掉相關的結果，個人覺得混合兩種方式進行bool查詢比較好，并且對match_phrase比對采用boost權重，比如對name進行2種分詞并索引，ik_smart分詞采用match_phrase比對，ik_max_word分詞采用match比對，如：

{
  "query": {
    "bool": {
      "should": [
          {"match_phrase": {"name1": {"query": "第十", "boost": 2}}},
          {"match": {"name2": "第十"}}
      ]
    }
  },
  explain: true

ElasticSearch的match和match_phrase查詢

問題：

問題分析：

解決方案：

最佳實踐：

轉自：https://zhuanlan.zhihu.com/p/25970549

繼續閱讀

PG中國開發者社群—王健PG峰會專訪

10分鐘，不會“搬磚”的校長搭出了智慧校園微應用

玩轉使用者身份權益——詳解閑魚身份權益體系的實作背景技術實作結語

域名對SEO優化到底有多大影響！

SAP Spartacus Customizing Meta Tags

啥是工業4.0？真實案例來了

用Elasticsearch做大規模資料的多字段、多類型索引檢索

資料索引---Solr Cloud(Solr5) with ZooKeeper1.單機多節點2.多節點完全分布式SolrCloud附：Solr Command

ES修改mapping映射type或全部結構

路人什麼時候離開鮮花店？他什麼時候開始過馬路？然後就該輪到他出場了。先把一支鉛筆豎着放在馬路對面的郵箱上，隻要有一點風吹

在資訊技術的驅動下建立并設計品質評估管理體系，進行地方财政活動的有關資料的有效擷取和整理，運用科學高效的資料模型進行資訊

MYSQL資料檢索總結

Lucene總結一：全文檢索的基本原理

資料索引---Solr DIH

雜湊演算法是什麼？雜湊演算法是一種數學函數或者算法，它可以将任意長度的資料轉換為固定長度的字元串🔗。雜湊演算法的作用是将資料進