問題:
索引中有『第十人民醫院』這個字段,使用IK分詞結果如下 :
POST http://localhost:9200/development_hospitals/_analyze?pretty&field=hospital.names&analyzer=ik
{
"tokens": [
{
"token": "第十",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "十人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 1
},
{
"token": "十",
"start_offset": 1,
"end_offset": 2,
"type": "TYPE_CNUM",
"position": 2
},
{
"token": "人民醫院",
"start_offset": 2,
"end_offset": 6,
"type": "CN_WORD",
"position": 3
},
{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 4
},
{
"token": "人",
"start_offset": 2,
"end_offset": 3,
"type": "COUNT",
"position": 5
},
{
"token": "民醫院",
"start_offset": 3,
"end_offset": 6,
"type": "CN_WORD",
"position": 6
},
{
"token": "醫院",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 7
}
]
}
使用Postman建構match查詢:
可以得到結果,但是使用match_phrase查詢『第十』卻沒有任何結果
問題分析:
參考文檔 The Definitive Guide [2.x] | Elastic
phrase搜尋跟關鍵字的位置有關, 『第十』采用ik_max_word分詞結果如下
{
"tokens": [
{
"token": "第十",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "十",
"start_offset": 1,
"end_offset": 2,
"type": "TYPE_CNUM",
"position": 1
}
]
}
雖然『第十』和『十』都可以命中,但是match_phrase的特點是分詞後的相對位置也必須要精準比對,『第十人民醫院』采用id_max_word分詞後,『第十』和『十』之間有一個『十人』,是以無法命中。
解決方案:
采用ik_smart分詞可以避免這樣的問題,對『第十人民醫院』和『第十』采用ik_smart分詞的結果分别是:
{
"tokens": [
{
"token": "第十",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "人民醫院",
"start_offset": 2,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
}
]
}
{
"tokens": [
{
"token": "第十",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
}
]
}
穩穩命中
最佳實踐:
采用match_phrase比對,結果會非常嚴格,但是也會漏掉相關的結果,個人覺得混合兩種方式進行bool查詢比較好,并且對match_phrase比對采用boost權重,比如對name進行2種分詞并索引,ik_smart分詞采用match_phrase比對,ik_max_word分詞采用match比對,如:
{
"query": {
"bool": {
"should": [
{"match_phrase": {"name1": {"query": "第十", "boost": 2}}},
{"match": {"name2": "第十"}}
]
}
},
explain: true
}