天天看點

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

IK分詞器

下載下傳位址:​​https://github.com/medcl/elasticsearch-analysis-ik​​

也可以在這個位址選擇:​​https://github.com/medcl/elasticsearch-analysis-ik/releases​​​

這個下載下傳下來了可以直接使用, 是以推薦下載下傳這個

選擇elasticsearch對應版本的分詞器進行下載下傳

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

進入到對應頁面下載下傳

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

找到下載下傳好的檔案,右鍵,解壓到目前檔案夾

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

進入檔案夾,cmd進入dos視窗,使用maven打包

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

輸入指令,打包,前提是安裝好了maven

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

指令:

mvn package      

打包好了過後,目前目錄多了一個target檔案夾,點選進入

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

點選進入releases檔案夾

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

右鍵,解壓到目前檔案夾

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

進入解壓後的檔案夾,複制所有檔案

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

找到elasticsearch安裝目錄,在plugins檔案夾下面建立ik(任意取名,友善記憶)檔案夾,把剛才複制的檔案粘貼到ik檔案夾下面

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器
ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

拼音分詞器

下載下傳位址:​​https://github.com/medcl/elasticsearch-analysis-pinyin​​

也可以在這個位址選擇:​​https://github.com/medcl/elasticsearch-analysis-pinyin/releases​​

下載下傳,安裝過程和ik分詞器一模一樣,參考上面步驟

最終結果

ElasticSearch學習 - (八)安裝中文分詞器IK和拼音分詞器

測試分詞效果

elasticsearch自帶分詞器效果

GET http://localhost:9200/_analyze?pretty=true
{
  "analyzer" : "standard",
  "text" : "我是一名java程式員"      

分詞效果如下:

{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1}
,
{
"token": "一",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2}
,
{
"token": "名",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 4}
,
{
"token": "程",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 5}
,
{
"token": "序",
"start_offset": 9,
"end_offset": 10,
"type": "<IDEOGRAPHIC>",
"position": 6}
,
{
"token": "員",
"start_offset": 10,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 7}
]
}      

使用ik_max_word分詞

ik_max_word :會将文本做最細粒度的拆分;盡可能多的拆分出詞語

GET http://localhost:9200/_analyze?pretty=true
{
  "analyzer" : "ik_max_word",
  "text" : "我是一名java程式員"      

效果如下:

{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1}
,
{
"token": "一名",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2}
,
{
"token": "一",
"start_offset": 2,
"end_offset": 3,
"type": "TYPE_CNUM",
"position": 3}
,
{
"token": "名",
"start_offset": 3,
"end_offset": 4,
"type": "COUNT",
"position": 4}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "ENGLISH",
"position": 5}
,
{
"token": "程式員",
"start_offset": 8,
"end_offset": 11,
"type": "CN_WORD",
"position": 6}
,
{
"token": "程式",
"start_offset": 8,
"end_offset": 10,
"type": "CN_WORD",
"position": 7}
,
{
"token": "員",
"start_offset": 10,
"end_offset": 11,
"type": "CN_CHAR",
"position": 8}
]
}      

使用ik_smart分詞

ik_smart:會做最粗粒度的拆分;已被分出的詞語将不會再次被其它詞語占有

GET http://localhost:9200/_analyze?pretty=true
{
  "analyzer" : "ik_smart",
  "text" : "我是一名java程式員"      

分詞效果如下:

{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0}
,
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1}
,
{
"token": "一名",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 2}
,
{
"token": "java",
"start_offset": 4,
"end_offset": 8,
"type": "ENGLISH",
"position": 3}
,
{
"token": "程式員",
"start_offset": 8,
"end_offset": 11,
"type": "CN_WORD",
"position": 4}
]
}      

使用pinyin分詞

http://localhost:9200/_analyze?pretty=true
{
  "analyzer" : "pinyin",
  "text" : "我是一名java程式員"      

效果如下:

{
"tokens": [
{
"token": "wo",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0}
,
{
"token": "wsymjavacxy",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 0}
,
{
"token": "shi",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1}
,
{
"token": "yi",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2}
,
{
"token": "ming",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 3}
,
{
"token": "ja",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 4}
,
{
"token": "v",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 5}
,
{
"token": "a",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 6}
,
{
"token": "cheng",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 7}
,
{
"token": "xu",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 8}
,
{
"token": "yuan",
"start_offset": 10,
"end_offset": 11,
"type": "word",
"position": 9}
]
}      

IK+pinyin分詞配置

建立索引和類型 ​

-put http://localhost:9200/demo

{
  "settings": {
    "analysis": {
        "analyzer": {
            "ik_pinyin_analyzer": {//分詞器名稱,自定義
                "type": "custom",//custom表示自己定制
                "tokenizer": "ik_max_word",//分詞的政策
                "filter":["my_pinyin", "word_delimiter"]// 對拼音和分隔的詞源做處理
            }
        },
        "filter":{
            "my_pinyin":{
                "type":"pinyin",
                "first_letter":"prefix",
                "padding_char":" "
            }
        }
    }
  },
  "mappings": {
    "article": {
      "properties": {
        "subject": {
          "type": "keyword",
          "fields": {
              "pinyin": {
                  "type": "text",
                  "store": "no",
                  "term_vector": "with_positions_offsets",
                  "analyzer": "ik_pinyin_analyzer",
                  "boost": 10      

索引一個文檔

-post http://localhost:9200/demo/article

{
  "subject": "我是一名java程式員"      

中文查詢

-post http://localhost:9200/demo/article/_search

{
  "query": {
    "match": {
      "subject.pinyin": "程式員"      

結果如下:

{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0},
"hits": {
"total": 1,
"max_score": 14.584841,
"hits": [
{
"_index": "demo",
"_type": "article",
"_id": "AWIeeeTJ2JGj7w9eQwEK",
"_score": 14.584841,
"_source": {
"subject": "我是一名java程式員"}
}
]
}
}      

拼音查詢

-post http://localhost:9200/demo/article/_search

{
  "query": {
    "match": {
      "subject.pinyin": "chengxuyuan"      

查詢結果:

{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0},
"hits": {
"total": 1,
"max_score": 4.3648314,
"hits": [
{
"_index": "demo",
"_type": "article",
"_id": "AWIeeeTJ2JGj7w9eQwEK",
"_score": 4.3648314,
"_source": {
"subject": "我是一名java程式員"}
}
]
}
}