【通用行業開發部】Elasticsearch內建IK分詞器

IK分詞器

NOTE: 預設ES中采用标準分詞器進行分詞,這種方式并不适用于中文網站,是以需要修改ES對中文友好分詞,進而達到更佳的搜尋的效果。

線上安裝IK

線上安裝IK (v5.5.1版本後開始支援線上安裝 )

# 1. 在es安裝目錄中執行如下指令
[es@linux elasticsearch-6.2.4]$ ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.8.0/elasticsearch-analysis-ik-6.8.0.zip
-> Downloading https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4/elasticsearch-analysis-ik-6.2.4.zip
[=================================================] 100%
-> Installed analysis-ik
[es@linux elasticsearch-6.2.4]$ ls plugins/analysis-ik
[es@linux elasticsearch-6.2.4]$ cd plugins/analysis-ik/
[es@linux analysis-ik]$ ls
commons-codec-1.9.jar    elasticsearch-analysis-ik-6.2.4.jar  httpcore-4.4.4.jar
commons-logging-1.2.jar  httpclient-4.5.2.jar                 plugin-descriptor.properties
# 2.重新開機es生效
# 3.測試ik安裝成功
GET /_analyze
{
  "text": "中華人民共和國國歌",
  "analyzer": "ik_smart"
}
# 4.線上安裝IK配置檔案  
- es安裝目錄中config目錄analysis-ik/IKAnalyzer.cfg.xml

NOTE: 要求版本嚴格與目前使用版本一緻,如需使用其他版本替換

6.2.4

為使用的版本号

本地安裝IK

可以将對應的IK分詞器下載下傳到本地,然後再安裝

# 1. 下載下傳對應版本
- [es@linux ~]$ wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4/elasticsearch-analysis-ik-6.2.4.zip
# 2. 解壓
- [es@linux ~]$ unzip elasticsearch-analysis-ik-6.2.4.zip #先使用yum install -y unzip
# 3. 移動到es安裝目錄的plugins目錄中
- [es@linux ~]$ ls elasticsearch-6.2.4/plugins/
  [es@linux ~]$ mv elasticsearch elasticsearch-6.2.4/plugins/
  [es@linux ~]$ ls elasticsearch-6.2.4/plugins/elasticsearch
  [es@linux ~]$ ls elasticsearch-6.2.4/plugins/elasticsearch/
        commons-codec-1.9.jar    config                               httpclient-4.5.2.jar          plugin-descriptor.properties
        commons-logging-1.2.jar  elasticsearch-analysis-ik-6.2.4.jar  httpcore-4.4.4.jar
# 4. 重新開機es生效
# 5. 本地安裝ik配置目錄為  
- es安裝目錄中/plugins/analysis-ik/config/IKAnalyzer.cfg.xml

測試IK分詞器

NOTE: IK分詞器提供了兩種mapping類型用來做文檔的分詞分别是

ik_max_word

和

ik_smart

ik_max_word 和 ik_smart 什麼差別?

ik_max_word: 會将文本做最細粒度的拆分

，比如會将“中華人民共和國國歌”拆分為“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”，會窮盡各種可能的組合；

ik_smart: 會做最粗粒度的拆分

，比如會将“中華人民共和國國歌”拆分為“中華人民共和國,國歌”。

測試資料

DELETE /ems

PUT /ems
{
  "mappings":{
    "emp":{
      "properties":{
        "name":{
          "type":"text",
           "analyzer": "ik_max_word"
        },
        "age":{
          "type":"integer"
        },
        "bir":{
          "type":"date"
        },
        "content":{
          "type":"text",
          "analyzer": "ik_max_word"
        },
        "address":{
          "type":"keyword"
        }
      }
    }
  }
}

PUT /ems/emp/_bulk
  {"index":{}}
  {"name":"小黑","age":23,"bir":"2012-12-12","content":"為開發團隊選擇一款優秀的MVC架構是件難事兒，在衆多可行的方案中決擇需要很高的經驗和水準","address":"北京"}
  {"index":{}}
  {"name":"王小黑","age":24,"bir":"2012-12-12","content":"Spring 架構是一個分層架構，由 7 個定義良好的子產品組成。Spring 子產品建構在核心容器之上，核心容器定義了建立、配置和管理 bean 的方式","address":"上海"}
  {"index":{}}
  {"name":"張小五","age":8,"bir":"2012-12-12","content":"Spring Cloud 作為Java 語言的微服務架構，它依賴于Spring Boot，有快速開發、持續傳遞和容易部署等特點。Spring Cloud 的元件非常多，涉及微服務的方方面面，井在開源社群Spring 和Netflix 、Pivotal 兩大公司的推動下越來越完善","address":"無錫"}
  {"index":{}}
  {"name":"win7","age":9,"bir":"2012-12-12","content":"Spring的目标是緻力于全方位的簡化Java開發。 這勢必引出更多的解釋， Spring是如何簡化Java開發的？","address":"南京"}
  {"index":{}}
  {"name":"梅超風","age":43,"bir":"2012-12-12","content":"Redis是一個開源的使用ANSI C語言編寫、支援網絡、可基于記憶體亦可持久化的日志型、Key-Value資料庫，并提供多種語言的API","address":"杭州"}
  {"index":{}}
  {"name":"張無忌","age":59,"bir":"2012-12-12","content":"ElasticSearch是一個基于Lucene的搜尋伺服器。它提供了一個分布式多使用者能力的全文搜尋引擎，基于RESTful web接口","address":"北京"}


GET /ems/emp/_search
{
  "query":{
    "term":{
      "content":"架構"
    }
  },
  "highlight": {
    "pre_tags": ["<span style='color:red'>"],
    "post_tags": ["</span>"],
    "fields": {
      "*":{}
    }
  }
}

配置擴充詞

IK支援自定義

擴充詞典

停用詞典

,所謂

擴充詞典

就是有些詞并不是關鍵詞,但是也希望被ES用來作為檢索的關鍵詞,可以将這些詞加入擴充詞典。

停用詞典

就是有些詞是關鍵詞,但是出于業務場景不想使用這些關鍵詞被檢索到，可以将這些詞放入停用詞典。

如何定義擴充詞典和停用詞典可以修改IK分詞器中

config

目錄中

IKAnalyzer.cfg.xml

這個檔案。

NOTE：詞典的編碼必須為UTF-8，否則無法生效

1. 修改vim IKAnalyzer.cfg.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
    <properties>
        <comment>IK Analyzer 擴充配置</comment>
        <!--使用者可以在這裡配置自己的擴充字典 -->
        <entry key="ext_dict">ext_dict.dic</entry>
         <!--使用者可以在這裡配置自己的擴充停止詞字典-->
        <entry key="ext_stopwords">ext_stopword.dic</entry>
    </properties>

2. 在ik分詞器目錄下config目錄中建立ext_dict.dic檔案   編碼一定要為UTF-8才能生效
    vim ext_dict.dic 加入擴充詞即可

3. 在ik分詞器目錄下config目錄中建立ext_stopword.dic檔案 
    vim ext_stopword.dic 加入停用詞即可

4.重新開機es生效

【通用行業開發部】Elasticsearch內建IK分詞器

IK分詞器

線上安裝IK

本地安裝IK

測試IK分詞器

測試資料

配置擴充詞

繼續閱讀

seq2seq模型 + Attention機制

傳統的seq2seq模型與seq2seq with attention的模型原理細節解析

torch.nn.Embedding的使用torch.nn.Embedding

nn.Embedding()參數的了解nn.Embedding()

pytorch中nn.RNN()總結

聯考志願填報：人工智能專業怎麼樣？人工智能行業發展前景如何？

【Python學習筆記】- Day6

Windows版本的Google word2vec和Stanford GloVe工具

seq2sqe與attenton實作聊天機器人

奮戰聊天機器人（四）自然語言進行中的文本分類nltk中的貝葉斯分類器

從詞向量衡量标準到全局向量的詞嵌入模型GloVe再到一詞多義的解決方式衡量标準Evaluation引子全局向量的詞嵌入應用對一詞多義的思考Reference

GloVe與word2vec的差別，及GloVe的缺陷

統計學習大作業-BERT模型1 文本處理-BERT模型2 參考資料：

anaconda中科大鏡像

NLP從入門到放棄_IBM Model1IBM Model1

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合