【ELK】elasticsearch分詞器介紹和使用

内置分詞器

什麼是分詞器

分詞器,是将使用者輸入的一段文本,分析成符合邏輯的一種工具。
常見内置分詞器
- Standard Analyzer - 預設分詞器，按詞切分，小寫處理
- Simple Analyzer - 按照非字母切分(符号被過濾), 小寫處理
- Stop Analyzer - 小寫處理，停用詞過濾(the,a,is)
- Whitespace Analyzer - 按照空格切分，不轉小寫
- Patter Analyzer - 正規表達式，預設\W+(非字元分割)
- Language - 提供了30多種常見語言的分詞器

Standard Analyzer

标準分析器是預設分詞器，如果未指定，則使用該分詞器。

POST /_analyze
{
  "analyzer": "standard",
  "text":"The quick brown fox."
}

Simple Analyzer

按照非字母切分(符号被過濾), 小寫處理

POST /_analyze
{
  "analyzer": "simple",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Whitespace Analyzer

按照空格切分，不轉小寫

POST /_analyze
{
  "analyzer": "whitespace",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

為指定字段指定分詞器

PUT /my-index-000001/_doc/1
{
  "title": "The 2 QUICK Brown-Foxes jumped overthe lazy dog's bone."
}

GET /my-index-000001/_search
{
  "query": {
    "match": {
      "title": "dog"
   }
 }
}

測試搜尋

PUT /my-index-000001
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "whitespace",
        "search_analyzer": "simple"
     }
   }
 }
}

GET /my-index-000001/_mapping

PUT /my-index-000001/_doc/1
{
  "title": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

GET /my-index-000001/_search
{
  "query": {
    "match": {
      "title": "dog's jumped"
   }
 }
}

IK中文分詞器

使用用預設的分詞器standard

POST /_analyze
{
  "analyzer": "standard",
  "text":"中華人民共和國國歌"
}

IK分詞器
- 下載下傳：https://github.com/medcl/elasticsearch-analysis-ik
- 解壓到plugins/ik
  - chmod 777 elasticsearch-analysis-ik-7.8.0.zip
  - cd /usr/local/elk/plugins
  - mkdir ik
  - cd ik
  - cp /opt/soft/elasticsearch-analysis-ik-7.8.0.zip .
  - unzip elasticsearch-analysis-ik-7.8.0.zip
  - rm elasticsearch-analysis-ik-7.8.0.zip
- 重新開機es
  - kill -15 5691
  - bin/elasticsearch -d -p fx.pid

測試

POST /_analyze
{
  "analyzer": "ik_max_word",
  "text":"中華人民共和國國歌"
}

POST /_analyze
{
  "analyzer": "ik_smart",
  "text":"中華人民共和國國歌"
}

ik_max_word 和 ik_smart 什麼差別
- ik_max_word: 會将文本做最細粒度的拆分，比如會将“中華人民共和國國歌”拆分為“中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌”，會窮盡各種可能的組合；
ik_smart: 會做最粗粒度的拆分，比如會将“中華人民共和國國歌”拆分為“中華人民共和國,國歌”。
檢視詞庫
```
head config/main.dic
           
```

自定義詞庫

修改配置檔案config/IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM
"http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 擴充配置</comment>
        <!--使用者可以在這裡配置自己的擴充字典 -->
        <entry key="ext_dict">fx.dic</entry>
        <!--使用者可以在這裡配置自己的擴充停止詞字典-->
        <entry key="ext_stopwords"></entry>
        <!--使用者可以在這裡配置遠端擴充字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--使用者可以在這裡配置遠端擴充停止詞字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry>
-->
</properties>

cat fx.dic
網紅
社畜

需要重新開機ES生效

POST /my-index-000001/_analyze
{
  "analyzer": "ik_max_word",
  "text":"社畜"
}

POST /my-index-000001/_analyze
{
  "analyzer": "ik_smart",
  "text":"網紅"
}

IK分詞器支援熱更新，但是不穩定，可以通過修改源碼實作，有興趣可以研究！

【ELK】elasticsearch分詞器介紹和使用

内置分詞器

IK中文分詞器

繼續閱讀

ELK詳細分析nginx日志1. 實驗規劃2. ELK安裝3. logstash伺服器的配置4. kibana檢視配置5. 建立儀表闆顯示pv和uv

ElasticSearch：部署ElasticSearch & Kibana

ES分詞插件IK Analyzer安裝

【elasticsearch】The number of object passed must be even but was [1]1.概述

跟據經緯度實作附近搜尋Java實作

【最新 v7.9】Elasticsearch的基本概念與配置

圖解elasticsearch的_source、_all、store和index

深入elasticsearch源碼之環境搭建

elasticsearch 的 Percolator操作

es使用項目中遇到的問題

15.profile-api

【轉】ElasticSearch是什麼以及應用場景

ElasticSearch是什麼以及應用場景ES是如何産生的？ES 基礎一網打盡ES特點和優勢為什麼要用ES？ES的應用場景是怎樣的？

延雲行業搜尋資料庫在大資料生态中位置和重要性大資料的挑戰大資料技術的現狀延雲行業搜尋資料庫

尚矽谷—韓順平—圖解 Java設計模式（結構型）（55～）

30天了解30種技術系列---(10)面向Cloud的搜尋引擎 ElasticSearch