1. 環境說明
- elasticsearch7.9.3
- elasticsearch-analysis-ik-7.9.3
- kibana7.9.3(與此需求無關)
2. 分析思路
- 由于es在存儲資料時如果使用ik分詞器, 進行如下配置:
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 5
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart",
"index": true,
"store": false
}
}
}
- 在分詞過程中, 預設IK分詞器隻會處理分詞, 但是單個字是不會變成term儲存進倒排表的
- 是以如果要做單個字的全文檢索, 就需要增加額外字典
- 對檢索進行優化, 對檢索詞會進行最大粒度分詞, 比如: 在檢索:"手機殼"的時候, 就不會将"手機殼"拆分為"手機"和"手機殼"等, 避免搜尋手機殼的時候出現手機的結果
3. analysis-ik配置
- 修改配置檔案elasticsearch-7.9.3\plugins\elasticsearch-analysis-ik-7.9.3\config
- 配置中相對路徑都是以config下路徑
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 擴充配置</comment>
<!--使用者可以在這裡配置自己的擴充字典 -->
<entry key="ext_dict">extra_single_word.dic</entry>
<!--使用者可以在這裡配置自己的擴充停止詞字典-->
<entry key="ext_stopwords"></entry>
<!--使用者可以在這裡配置遠端擴充字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--使用者可以在這裡配置遠端擴充停止詞字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
4. 重新開機ES并重建索引
再一次建立以下索引:
{
"settings": {
"number_of_replicas": 1,
"number_of_shards": 5
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart",
"index": true,
"store": false
}
}
}
此時, 再存入的資料會根據IK分詞器+額外字典(單字字典)進行分詞。
kibana測試分詞
POST _analyze
{
"analyzer": "ik_max_word",
"text": ["我們是共産主義接班人。"]
}
分詞結果
{
"tokens" : [
{
"token" : "我們",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "們",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "是",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "共産主義",
"start_offset" : 3,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "共産",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "共",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "産",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "主義",
"start_offset" : 5,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 8
},
{
"token" : "主",
"start_offset" : 5,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "義",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_WORD",
"position" : 10
},
{
"token" : "接班人",
"start_offset" : 7,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 11
},
{
"token" : "接班",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 12
},
{
"token" : "接",
"start_offset" : 7,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 13
},
{
"token" : "班",
"start_offset" : 8,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 14
},
{
"token" : "人",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 15
}
]
}
參考文章:
https://blog.csdn.net/nazeniwaresakini/article/details/104220237