天天看點

ElasticSearch2.2.1之IK分詞器的安裝

安裝

  1. 首先到github ik上下載下傳版本為1.8.1的源碼,可以直接下載下傳zip檔案,也可以通過git下載下傳。
  2. 解壓檔案

    elasticsearch-analyze-ik-1.8.1.zip

    ,在下載下傳目錄執行

    unzip elasticsearch-analyze-ik-1.8.1.zip -d ik

  3. 進到ik目錄下

    cd ik

  4. 用maven進行編譯打包,需要裝好maven,執行

    mvn package

  5. 打包完後在target/release目錄下,出現

    elasticsearch-analysis-ik-1.8.1.zip

  6. 将該壓縮檔案解壓并複制到Elasticsearch每個節點的

    ES_HOME/plugins/lk

    目錄下
  7. 重新開機每個節點

注: 如果安裝其他版本,請檢視https://github.com/medcl/elasticsearch-analysis-ik,在分支那裡選擇對應的版本下載下傳。

測試

建立索引

配置映射

curl -XPOST http://host:9200/iktest/fulltext/_mapping -d'
{
    "fulltext": {
             "_all": {
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word",
            "term_vector": "no",
            "store": "false"
        },
        "properties": {
            "content": {
                "type": "string",
                "store": "no",
                "term_vector": "with_positions_offsets",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word",
                "include_in_all": "true",
                "boost": 
            }
        }
    }
}'
           

ik_max_word: 會将文本做最細粒度的拆分,比如會将“中華人民共和國國歌”拆分為“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”,會窮盡各種可能的組合;

ik_smart: 會做最粗粒度的拆分,比如會将“中華人民共和國國歌”拆分為“中華人民共和國,國歌”。

索引文檔

curl -XPOST http://host:9200/iktest/fulltext/1 -d'
{"content":"美國留給伊拉克的是個爛攤子嗎"}
'
           
curl -XPOST http://host:9200/iktest/fulltext/2 -d'
{"content":"公安部:各地校車将享最高路權"}
'
           
curl -XPOST http://host:9200/iktest/fulltext/3 -d'
{"content":"中韓漁警沖突調查:韓警平均每天扣1艘中國漁船"}
           
curl -XPOST http://host:9200/iktest/fulltext/4 -d'
{"content":"中國駐洛杉矶領事館遭亞裔男子槍擊 嫌犯已自首"}
'
           

查詢

curl -XPOST http://localhost:9200/iktest/fulltext/_search  -d'
{
    "query" : { "term" : { "content" : "中國" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}
'
           

結果為

{
  "took": ,
  "timed_out": false,
  "_shards": {
    "total": ,
    "successful": ,
    "failed": 
  },
  "hits": {
    "total": ,
    "max_score": ,
    "hits": [
      {
        "_index": "iktest",
        "_type": "fulltext",
        "_id": "4",
        "_score": ,
        "_source": {
          "content": "中國駐洛杉矶領事館遭亞裔男子槍擊 嫌犯已自首"
        },
        "highlight": {
          "content": [
            "<tag1>中國</tag1>駐洛杉矶領事館遭亞裔男子槍擊 嫌犯已自首"
          ]
        }
      },
      {
        "_index": "iktest",
        "_type": "fulltext",
        "_id": "3",
        "_score": ,
        "_source": {
          "content": "中韓漁警沖突調查:韓警平均每天扣1艘中國漁船"
        },
        "highlight": {
          "content": [
            "中韓漁警沖突調查:韓警平均每天扣1艘<tag1>中國</tag1>漁船"
          ]
        }
      }
    ]
  }
}
           

分詞結果檢視

curl 'http://host:9200/index/_analyze?analyzer=ik&pretty=true' -d '
{
  "text": "别說話,我想靜靜"
}'
           

結果

{
  "tokens": [
    {
      "token": "别說",
      "start_offset": ,
      "end_offset": ,
      "type": "CN_WORD",
      "position": 
    },
    {
      "token": "說話",
      "start_offset": ,
      "end_offset": ,
      "type": "CN_WORD",
      "position": 
    },
    {
      "token": "我",
      "start_offset": ,
      "end_offset": ,
      "type": "CN_CHAR",
      "position": 
    },
    {
      "token": "想",
      "start_offset": ,
      "end_offset": ,
      "type": "CN_CHAR",
      "position": 
    },
    {
      "token": "靜靜",
      "start_offset": ,
      "end_offset": ,
      "type": "CN_WORD",
      "position": 
    },
    {
      "token": "靜",
      "start_offset": ,
      "end_offset": ,
      "type": "CN_WORD",
      "position": 
    },
    {
      "token": "靜",
      "start_offset": ,
      "end_offset": ,
      "type": "CN_WORD",
      "position": 
    }
  ]
}
           

繼續閱讀