IK分詞器
下載下傳位址:https://github.com/medcl/elasticsearch-analysis-ik
也可以在這個位址選擇:https://github.com/medcl/elasticsearch-analysis-ik/releases
這個下載下傳下來了可以直接使用, 是以推薦下載下傳這個
選擇elasticsearch對應版本的分詞器進行下載下傳

進入到對應頁面下載下傳
找到下載下傳好的檔案,右鍵,解壓到目前檔案夾
進入檔案夾,cmd進入dos視窗,使用maven打包
輸入指令,打包,前提是安裝好了maven
指令:
mvn package
打包好了過後,目前目錄多了一個target檔案夾,點選進入
點選進入releases檔案夾
右鍵,解壓到目前檔案夾
進入解壓後的檔案夾,複制所有檔案
找到elasticsearch安裝目錄,在plugins檔案夾下面建立ik(任意取名,友善記憶)檔案夾,把剛才複制的檔案粘貼到ik檔案夾下面
拼音分詞器
下載下傳位址:https://github.com/medcl/elasticsearch-analysis-pinyin
也可以在這個位址選擇:https://github.com/medcl/elasticsearch-analysis-pinyin/releases
下載下傳,安裝過程和ik分詞器一模一樣,參考上面步驟
最終結果
測試分詞效果
elasticsearch自帶分詞器效果
GET http://localhost:9200/_analyze?pretty=true
{
"analyzer" : "standard",
"text" : "我是一名java程式員"
分詞效果如下:
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1}
,
{
"token": "一",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2}
,
{
"token": "名",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 4}
,
{
"token": "程",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 5}
,
{
"token": "序",
"start_offset": 9,
"end_offset": 10,
"type": "<IDEOGRAPHIC>",
"position": 6}
,
{
"token": "員",
"start_offset": 10,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 7}
]
}
使用ik_max_word分詞
ik_max_word :會将文本做最細粒度的拆分;盡可能多的拆分出詞語
GET http://localhost:9200/_analyze?pretty=true
{
"analyzer" : "ik_max_word",
"text" : "我是一名java程式員"
效果如下:
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1}
,
{
"token": "一名",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2}
,
{
"token": "一",
"start_offset": 2,
"end_offset": 3,
"type": "TYPE_CNUM",
"position": 3}
,
{
"token": "名",
"start_offset": 3,
"end_offset": 4,
"type": "COUNT",
"position": 4}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "ENGLISH",
"position": 5}
,
{
"token": "程式員",
"start_offset": 8,
"end_offset": 11,
"type": "CN_WORD",
"position": 6}
,
{
"token": "程式",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 7}
,
{
"token": "員",
"start_offset": 10,
"end_offset": 11,
"type": "CN_CHAR",
"position": 8}
]
}
使用ik_smart分詞
ik_smart:會做最粗粒度的拆分;已被分出的詞語将不會再次被其它詞語占有
GET http://localhost:9200/_analyze?pretty=true
{
"analyzer" : "ik_smart",
"text" : "我是一名java程式員"
分詞效果如下:
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1}
,
{
"token": "一名",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "ENGLISH",
"position": 3}
,
{
"token": "程式員",
"start_offset": 8,
"end_offset": 11,
"type": "CN_WORD",
"position": 4}
]
}
使用pinyin分詞
http://localhost:9200/_analyze?pretty=true
{
"analyzer" : "pinyin",
"text" : "我是一名java程式員"
效果如下:
{
"tokens": [
{
"token": "wo",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0}
,
{
"token": "wsymjavacxy",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 0}
,
{
"token": "shi",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1}
,
{
"token": "yi",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2}
,
{
"token": "ming",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 3}
,
{
"token": "ja",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 4}
,
{
"token": "v",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 5}
,
{
"token": "a",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 6}
,
{
"token": "cheng",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 7}
,
{
"token": "xu",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 8}
,
{
"token": "yuan",
"start_offset": 10,
"end_offset": 11,
"type": "word",
"position": 9}
]
}
IK+pinyin分詞配置
建立索引和類型
-put http://localhost:9200/demo
{
"settings": {
"analysis": {
"analyzer": {
"ik_pinyin_analyzer": {//分詞器名稱,自定義
"type": "custom",//custom表示自己定制
"tokenizer": "ik_max_word",//分詞的政策
"filter":["my_pinyin", "word_delimiter"]// 對拼音和分隔的詞源做處理
}
},
"filter":{
"my_pinyin":{
"type":"pinyin",
"first_letter":"prefix",
"padding_char":" "
}
}
}
},
"mappings": {
"article": {
"properties": {
"subject": {
"type": "keyword",
"fields": {
"pinyin": {
"type": "text",
"store": "no",
"term_vector": "with_positions_offsets",
"analyzer": "ik_pinyin_analyzer",
"boost": 10
索引一個文檔
-post http://localhost:9200/demo/article
{
"subject": "我是一名java程式員"
中文查詢
-post http://localhost:9200/demo/article/_search
{
"query": {
"match": {
"subject.pinyin": "程式員"
結果如下:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0},
"hits": {
"total": 1,
"max_score": 14.584841,
"hits": [
{
"_index": "demo",
"_type": "article",
"_id": "AWIeeeTJ2JGj7w9eQwEK",
"_score": 14.584841,
"_source": {
"subject": "我是一名java程式員"}
}
]
}
}
拼音查詢
-post http://localhost:9200/demo/article/_search
{
"query": {
"match": {
"subject.pinyin": "chengxuyuan"
查詢結果:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0},
"hits": {
"total": 1,
"max_score": 4.3648314,
"hits": [
{
"_index": "demo",
"_type": "article",
"_id": "AWIeeeTJ2JGj7w9eQwEK",
"_score": 4.3648314,
"_source": {
"subject": "我是一名java程式員"}
}
]
}
}