Elasticsearch分析器（analyzer）以及與spring boot整合

文章目錄

- 1 analysis與analyzer
- - 1.1 内置的分詞器
  - 1.2 内置分詞器示例
  - 1.3 中文分詞
  - - 1.3.1 IK分詞器
    - 1.3.2 HanLP
    - 1.3.3 pinyin分詞器
  - 1.4 中文分詞示範
  - 1.5 分詞的實際應用
  - - 1.5.1 設定mapping
    - 1.5.2 插入資料
    - 1.5.3 查詢
  - 1.6 拼音分詞器
  - - 1.6.1 設定settings
    - 1.6.2 設定mapping
    - 1.6.3 資料的插入
    - 1.6.4 查詢
  - 1.7 自定義中文、拼音混合分詞器
  - - 1.7.1 設定settings
    - 1.7.2 mappings設定
    - 1.7.3 添加資料
    - 1.7.4 查詢
- 2 spring boot與Elasticsearch的整合
- - 2.1 添加依賴
  - 2.2 擷取ElasticsearchTemplate
  - 2.3 定義Movie實體類
  - 2.4 查詢

1 analysis與analyzer

analysis(隻是一個概念)，文本分析是将全文本轉換為一系列單詞的過程，也叫分詞。analysis是通過analyzer(分詞器)來實作的，可以使用Elasticsearch内置的分詞器，也可以自己去定制一些分詞器。除了在資料寫入的時候将詞條進行轉換，那麼在查詢的時候也需要使用相同的分析器對語句進行分析。

anaylzer是由Character Filter、Tokenizer和Token Filter三部分組成，例如有

Hello a World, the world is beautiful

：

Character Filter: 将文本中html标簽剔除掉。
Tokenizer: 按照規則進行分詞，在英文中按照空格分詞。
Token Filter: 去掉stop world(停頓詞，a, an, the, is, are等)，然後轉換小寫。

1.1 内置的分詞器

分詞器名稱	處理過程
Standard Analyzer	預設的分詞器，按詞切分，小寫處理
Simple Analyzer	按照非字母切分(符号被過濾)，小寫處理
Stop Analyzer	小寫處理，停用詞過濾(the, a, this)
Whitespace Analyzer	按照空格切分，不轉小寫
Keyword Analyzer	不分詞，直接将輸入當做輸出
Pattern Analyzer	正規表達式，預設是\W+(非字元串分隔)

1.2 内置分詞器示例

例如：

A. Standard Analyzer

GET _analyze
{
  "analyzer": "standard",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

B. Simple Analyzer

GET _analyze
{
  "analyzer": "simple",
  "text": "2 Running quick brown-foxes leap over lazy dog in the summer evening"
}

1.3 中文分詞

中文分詞在所有的搜尋引擎中都是一個很大的難點，中文的句子應該是切分成一個個的詞，一句中文，在不同的上下文中，其實是有不同的了解，例如下面這句話：

這個蘋果不大好吃/這個蘋果不大好吃

1.3.1 IK分詞器

IK分詞器支援自定義詞庫，支援熱更新分詞字典，位址為 https://github.com/medcl/elasticsearch-analysis-ik

elasticsearch-plugin.bat install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip

安裝步驟：

下載下傳zip包，下載下傳路徑為：https://github.com/medcl/elasticsearch-analysis-ik/releases
在Elasticsearch的plugins目錄下建立名為 analysis-ik 的目錄，将下載下傳好的zip包解壓在該目錄下
在dos指令行進入Elasticsearch的bin目錄下，執行 elasticsearch-plugin.bat list 即可檢視到該插件

IK分詞插件對應的分詞器有以下幾種：

ik_smart
ik_max_word

1.3.2 HanLP

安裝步驟如下：

下載下傳ZIP包，下載下傳路徑為：https://pan.baidu.com/s/1mFPNJXgiTPzZeqEjH_zifw#list/path=%2F，密碼i0o7
在Elasticsearch的plugins目錄下建立名為 analysis-hanlp 的目錄，将下載下傳好的zip包解壓在該目錄下.
下載下傳詞庫，位址為：https://github.com/hankcs/HanLP/releases
将analyzer-hanlp目錄下的data目錄删掉，然後将詞庫 data-for-1.7.5.zip 解壓到anayler-hanlp目錄下
将第2步解壓目錄下的 config 檔案夾中兩個檔案 hanlp.properties hanlp-remote.xml 拷貝到ES的家目錄中的config目錄下 analysis-hanlp 檔案夾中(analyzer-hanlp 目錄需要手動去建立)。
将課件中hanlp檔案夾中提供的六個檔案拷貝到 $ES_HOME\plugins\analysis-hanlp\data\dictionary\custom 目錄下。

HanLP對應的分詞器如下：

hanlp，預設的分詞
hanlp_standard，标準分詞
hanlp_index，索引分詞
hanlp_nlp，nlp分詞
hanlp_n_short，N-最短路分詞
hanlp_dijkstra，最短路分詞
hanlp_speed，極速詞典分詞

1.3.3 pinyin分詞器

安裝步驟：

下載下傳ZIP包，下載下傳路徑為：https://github.com/medcl/elasticsearch-analysis-pinyin/releases
在Elasticsearch的plugins目錄下建立名為 analyzer-pinyin 的目錄，将下載下傳好的zip包解壓在該目錄下.

1.4 中文分詞示範

ik_smart

GET _analyze
{
  "analyzer": "ik_smart",
  "text": ["劍橋分析公司多位高管對卧底記者說，他們確定了唐納德·特朗普在總統大選中獲勝"]
}

hanlp

GET _analyze
{
  "analyzer": "hanlp",
  "text": ["劍橋分析公司多位高管對卧底記者說，他們確定了唐納德·特朗普在總統大選中獲勝"]
}

1.5 分詞的實際應用

在如上列舉了很多的分詞器，那麼在實際中該如何應用？

1.5.1 設定mapping

要想使用分詞器，先要指定我們想要對那個字段使用何種分詞，如下所示：

PUT customers
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "hanlp_index"
      }
    }
  }
}

1.5.2 插入資料

POST customers/_bulk
{"index":{}}
{"content":"如不能登入，請在百端登入百度首頁，點選【登入遇到問題】，進行找回密碼操作"}
{"index":{}}
{"content":"網盤用戶端通路隐藏空間需要輸入密碼方可進入。"}
{"index":{}}
{"content":"劍橋的網盤不好用"}

1.5.3 查詢

GET customers/_search
{
  "query": {
    "match": {
      "content": "密碼"
    }
  }
}

1.6 拼音分詞器

在查詢的過程中我們可能需要使用拼音來進行查詢，在中文分詞器中我們介紹過 pinyin 分詞器，那麼在實際的工作中該如何使用呢？

1.6.1 設定settings

PUT /medcl 
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                 }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true
                }
            }
        }
    }
}

如上所示，我們基于現有的拼音分詞器定制了一個名為 pinyin_analyzer 這樣一個分詞器。可用的參數可以參照：https://github.com/medcl/elasticsearch-analysis-pinyin

拼音分詞器可選參數解析：

屬性名	解釋
keep_first_letter	值為true時: 将所有漢字的拼音首字母拼接到一起：李小璐 -> lxl
keep_full_pinyin	值為true：在最終的分詞結果中，會出現每個漢字的全拼：李小璐 -> li , xiao, lu
keep_none_chinese	值為true時: 是否保留非中文本，例如 java程式員，在最終的分詞結果單獨出現 java
keep_separate_first_lett	值為true時: 在最終的分詞結果單獨将每個漢字的首字母作為一個結果：李小璐 -> l, y
keep_joined_full_pinyin	值為true時：在最終的分詞結果中将所有漢字的拼音放到一起：李小璐 -> liuyan
keep_none_chinese_in_joined_full_pinyin	值為true時:将非中文内容文字和中文漢字拼音拼到一起
none_chinese_pinyin_tokenize	值為true時: 會将非中文按照可能的拼音進行拆分：wvwoxvlu -> w, v, wo, x, v, lu
keep_original	值為true時: 保留原始的輸入
remove_duplicated_term	值為true時: 移除重複

1.6.2 設定mapping

PUT medcl/_mapping
{
        "properties": {
            "name": {
                "type": "keyword",
                "fields": {
                    "pinyin": {
                        "type": "text",
                        "analyzer": "pinyin_analyzer",
                        "boost": 10
                    }
                }
            }
        }
}

1.6.3 資料的插入

POST medcl/_bulk
{"index":{}}
{"name": "劉德華"}
{"index":{}}
{"name": "張學友"}
{"index":{}}
{"name": "四大天王"}
{"index":{}}
{"name": "柳岩"}
{"index":{}}
{"name": "angel baby"}

1.6.4 查詢

GET medcl/_search
{
  "query": {
    "match": {
      "name.pinyin": "ldh"
    }
  }
}

1.7 自定義中文、拼音混合分詞器

1.7.1 設定settings

PUT goods
{
  "settings": {
    "analysis": {
      "analyzer": {
        "hanlp_standard_pinyin":{
          "type": "custom",
          "tokenizer": "hanlp_standard",
          "filter": ["my_pinyin"]
        }
      },
      "filter": {
        "my_pinyin": {
          "type" : "pinyin",
          "keep_separate_first_letter" : false,
          "keep_full_pinyin" : true,
          "keep_original" : true,
          "limit_first_letter_length" : 16,
          "lowercase" : true,
          "remove_duplicated_term" : true
        }
      }
    }
  }
}

1.7.2 mappings設定

PUT goods/_mapping
{"properties": {
    "content": {
      "type": "text",
      "analyzer": "hanlp_standard_pinyin"
    }
  }
}

1.7.3 添加資料

POST goods/_bulk
{"index":{}}
{"content":"如不能登入，請在百端登入百度首頁，點選【登入遇到問題】，進行找回密碼操作"}
{"index":{}}
{"content":"網盤用戶端通路隐藏空間需要輸入密碼方可進入。"}
{"index":{}}
{"content":"劍橋的網盤不好用"}

1.7.4 查詢

GET goods/_search
{
  "query": {
    "match": {
      "content": "caozuo"
    }
  },
  "highlight": {
    "pre_tags": "<em>",
    "post_tags": "</em>",
    "fields": {
      "content": {}
    }
  }
}

2 spring boot與Elasticsearch的整合

2.1 添加依賴

<dependency>
	<groupId>org.springframework.boot</groupId>
	<artifactId>spring-boot-starter-data-elasticsearch</artifactId>
</dependency>

2.2 擷取ElasticsearchTemplate

@Configuration
public class ElasticsearchConfig extends ElasticsearchConfigurationSupport {

    @Bean
    public Client elasticsearchClient() throws UnknownHostException {
        Settings settings = Settings.builder().put("cluster.name", "my-application").build();
        TransportClient client = new PreBuiltTransportClient(settings);
        client.addTransportAddress(new TransportAddress(InetAddress.getByName("127.0.0.1"), 9300));
        return client;
    }

    @Bean(name = {"elasticsearchTemplate"})
    public ElasticsearchTemplate elasticsearchTemplate() throws UnknownHostException {
        return new ElasticsearchTemplate(elasticsearchClient(), entityMapper());
    }

    // use the ElasticsearchEntityMapper
    @Bean
    @Override
    public EntityMapper entityMapper() {
        ElasticsearchEntityMapper entityMapper = new ElasticsearchEntityMapper(elasticsearchMappingContext(),
                new DefaultConversionService());
        entityMapper.setConversions(elasticsearchCustomConversions());
        return entityMapper;
    }
}

2.3 定義Movie實體類

@Document(indexName = "movies", type = "_doc")//movies是elasticsearch的索引
public class Movie {
    private String id;
    private String title;
    private Integer year;，
    private List<String> genre;
    // setters and getters
}

2.4 查詢

@RestController
@RequestMapping("/movie")
public class MovieController {

    private ElasticsearchTemplate elasticsearchTemplate;

    public MovieController( ElasticsearchTemplate elasticsearchTemplate) {
        this.elasticsearchTemplate = elasticsearchTemplate;
    }


    @GetMapping
    public Object getMovies(){
        SearchQuery searchQuery = new NativeSearchQueryBuilder()
                .withQuery(new RangeQueryBuilder("year").from(2016).to(2017))
                .build();
        List<Movie> movieList = elasticsearchTemplate.queryForList(searchQuery, Movie.class);

        return movieList;
    }
}