[ElasticSearch2.x]分析與分析器（Analyzer）

1. 分析過程

分析(analysis)是這樣一個過程：

首先，标記化一個文本塊為适用于反向索引單獨的詞(term)
然後标準化這些詞為标準形式，提高它們的“可搜尋性”或“查全率”

這個工作是分析器(Analyzer)完成的。

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#2-%E5%88%86%E6%9E%90%E5%99%A8%E7%BB%84%E6%88%90 2. 分析器組成

分析器（Analyzer）一般由三部分構成，字元過濾器（Character Filters）、分詞器（Tokenizers）、分詞過濾器（Token filters）。

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#21-%E5%AD%97%E7%AC%A6%E8%BF%87%E6%BB%A4%E5%99%A8 2.1 字元過濾器

首先字元串要按順序依次經過幾個字元過濾器(Character Filter)。它們的任務就是在分詞（tokenization）前對字元串進行一次處理。字元過濾器能夠剔除HTML标記，或者轉換"&"為"and"。

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#22-%E5%88%86%E8%AF%8D%E5%99%A8 2.2 分詞器

下一步，字元串經過分詞器(tokenizer)被分詞成獨立的詞條（ the string is tokenized into individual terms by a tokenizer）。一個簡單的分詞器(tokenizer)可以根據空格或逗号将文本分成詞條（A simple tokenizer might split the text into terms whenever it encounters whitespace or punctuation）。

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#23-%E5%88%86%E8%AF%8D%E8%BF%87%E6%BB%A4%E5%99%A8 2.3 分詞過濾器

最後，每個詞條都要按順序依次經過幾個分詞過濾器(Token Filters)。可以修改詞（例如，将"Quick"轉為小寫），删除詞（例如，停用詞像"a"、"and"、"the"等等），或者增加詞（例如，同義詞像"jump"和"leap"）。

Elasticsearch提供很多開箱即用的字元過濾器，分詞器和分詞過濾器。這些可以組合來建立自定義的分析器以應對不同的需求。

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#3-%E5%86%85%E5%BB%BA%E5%88%86%E6%9E%90%E5%99%A8 3. 内建分析器

不過，Elasticsearch還附帶了一些預裝的分析器，你可以直接使用它們。下面我們列出了最重要的幾個分析器，來示範一下它們有啥差異。我們來看看使用下面的字元串會産生什麼樣的分詞：

Set the shape to semi-transparent by calling set_trans(5)

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#31-%E6%A0%87%E5%87%86%E5%88%86%E6%9E%90%E5%99%A8standard-analyzer 3.1 标準分析器（Standard analyzer）

标準分析器是Elasticsearch預設使用的分析器。對于文本分析，它對于任何語言都是最佳選擇（對于任何一個國家的語言，這個分析器基本夠用）。它根據Unicode Consortium（

http://www.unicode.org/reports/tr29/

）的定義的單詞邊界(word boundaries)來切分文本，然後去掉大部分标點符号。最後，把所有詞轉為小寫。

@Test
    public void analyzeByAnalyzer() throws Exception {
        String standardAnalyzer = "standard";
        String value = "Set the shape to semi-transparent by calling set_trans(5)";
        AnalyzeAPI.analyzeByAnalyzer(client, standardAnalyzer, value);
    }

産生的結果為：

set, the, shape, to, semi, transparent, by, calling, set_trans, 5

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#32-%E7%AE%80%E5%8D%95%E5%88%86%E6%9E%90%E5%99%A8simple-analyzer 3.2 簡單分析器（Simple analyzer）

簡單分析器将依據不是字母的任何字元切分文本，然後把每個詞轉為小寫（The simple analyzer splits the text on anything that isn’t a letter, and lowercases the terms）。

@Test
    public void analyzeByAnalyzer() throws Exception {
        String simpleAnalyzer = "simple";
        String value = "Set the shape to semi-transparent by calling set_trans(5)";
        AnalyzeAPI.analyzeByAnalyzer(client, simpleAnalyzer, value);
    }

set, the, shape, to, semi, transparent, by, calling, set, trans

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#33-%E7%A9%BA%E6%A0%BC%E5%88%86%E6%9E%90%E5%99%A8whitespace-analyzer 3.3 空格分析器（Whitespace analyzer）

空格分析器依據空格切分文本（The whitespace analyzer splits the text on whitespace）。它不轉換小寫。

@Test
    public void analyzeByAnalyzer() throws Exception {
        String whitespaceAnalyzer = "whitespace";
        String value = "Set the shape to semi-transparent by calling set_trans(5)";
        AnalyzeAPI.analyzeByAnalyzer(client, whitespaceAnalyzer, value);
    }

産生結果為：

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#34-%E8%AF%AD%E8%A8%80%E5%88%86%E6%9E%90%E5%99%A8language-analyzers 3.4 語言分析器（Language analyzers）

特定語言分析器适用于很多語言（

https://www.elastic.co/guide/en/elasticsearch/reference/2.4/analysis-lang-analyzer.html

）。它們能夠考慮到特定語言的特性（They are able to take the peculiarities of the specified language into account）。例如，english分析器自帶一套英語停用詞庫（像and或the這些與語義無關的通用詞），分析器将會這些詞移除。因為文法規則的存在，英語單詞的主體含義依舊能被了解（This analyzer also is able to stem English words because it understands the rules of English grammar）。

以英語分析器舉例：

@Test
    public void analyzeByAnalyzer() throws Exception {
        String englishAnalyzer = "english";
        String value = "Set the shape to semi-transparent by calling set_trans(5)";
        AnalyzeAPI.analyzeByAnalyzer(client, englishAnalyzer, value);
    }

set, shape, semi, transpar, call, set_tran, 5

注意"transparent"、"calling"和"set_trans"是如何轉為詞幹的（stemmed to their root form）。

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#4-%E5%BD%93%E5%88%86%E6%9E%90%E5%99%A8%E8%A2%AB%E4%BD%BF%E7%94%A8 4. 當分析器被使用

當我們索引(index)一個文檔，全文字段會被分析為單獨的詞來建立反向索引。不過，當我們在全文字段搜尋(search)時，我們要讓查詢字元串經過同樣的分析流程處理，以確定這些詞在索引中存在。了解每個字段是如何定義的，這樣才可以讓它們做正确的事：

當你查詢全文(full text)字段，查詢将使用相同的分析器來分析查詢字元串，以産生正确的詞清單。
當你查詢一個确切值(exact value)字段，查詢将不分析查詢字元串，但是你可以自己指定。

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#5-%E6%B5%8B%E8%AF%95%E5%88%86%E6%9E%90%E5%99%A8 5. 測試分析器

尤其當你是Elasticsearch新手時，對于如何分詞以及存儲到索引中了解起來比較困難。為了更好的了解如何進行，你可以使用analyze API來檢視文本是如何被分析的。在查詢中指定要使用的分析器，以及被分析的文本。

/**
 * 使用分詞器進行詞條分析
 * @param client
 * @param analyzer
 * @param value
 */
public static void analyzeByAnalyzer(Client client, String analyzer, String value){
    IndicesAdminClient indicesAdminClient = client.admin().indices();
    AnalyzeRequestBuilder analyzeRequestBuilder = indicesAdminClient.prepareAnalyze(value);
    analyzeRequestBuilder.setAnalyzer(analyzer);
    AnalyzeResponse response = analyzeRequestBuilder.get();
    // 列印響應資訊
    print(response);
}

列印資訊：

/**
 * 列印響應資訊
 * @param response
 */
private static void print(AnalyzeResponse response){
    List<AnalyzeResponse.AnalyzeToken> tokenList = response.getTokens();
    for(AnalyzeResponse.AnalyzeToken token : tokenList){
        logger.info("-------- analyzeIndex type {}", token.getType());
        logger.info("-------- analyzeIndex term {}", token.getTerm());
        logger.info("-------- analyzeIndex position {}", token.getPosition());
        logger.info("-------- analyzeIndex startOffSet {}", token.getStartOffset());
        logger.info("-------- analyzeIndex endOffSet {}", token.getEndOffset());
        logger.info("----------------------------------");
    }
}

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#6-%E6%8C%87%E5%AE%9A%E5%88%86%E6%9E%90%E5%99%A8 6. 指定分析器

當Elasticsearch在你的文檔中探測到一個新的字元串字段，它将自動設定它為全文string字段并用standard分析器分析。

你不可能總是想要這樣做。也許你想使用一個更适合這個資料的語言分析器。或者，你隻想把字元串字段當作一個普通的字段——不做任何分析，隻存儲确切值，就像字元串類型的使用者ID或者内部狀态字段或者标簽。為了達到這種效果，我們必須通過映射(mapping)人工設定這些字段。

XContentBuilder mappingBuilder;
        try {
            mappingBuilder = XContentFactory.jsonBuilder()
                    .startObject()
                    .startObject(type)
                    .startObject("properties")
                    .startObject("club").field("type", "string").field("index", "analyzed").field("analyzer", "english").endObject()
                    .endObject()
                    .endObject()
                    .endObject();
        } catch (Exception e) {
            logger.error("--------- putIndexMapping 建立 mapping 失敗：", e);
            return false;
        }

參考：

https://www.elastic.co/guide/en/elasticsearch/guide/current/analysis-intro.html#analysis-intro

[ElasticSearch2.x]分析與分析器（Analyzer）

1. 分析過程

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#2-%E5%88%86%E6%9E%90%E5%99%A8%E7%BB%84%E6%88%90 2. 分析器組成

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#21-%E5%AD%97%E7%AC%A6%E8%BF%87%E6%BB%A4%E5%99%A8 2.1 字元過濾器

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#22-%E5%88%86%E8%AF%8D%E5%99%A8 2.2 分詞器

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#23-%E5%88%86%E8%AF%8D%E8%BF%87%E6%BB%A4%E5%99%A8 2.3 分詞過濾器

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#3-%E5%86%85%E5%BB%BA%E5%88%86%E6%9E%90%E5%99%A8 3. 内建分析器

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#31-%E6%A0%87%E5%87%86%E5%88%86%E6%9E%90%E5%99%A8standard-analyzer 3.1 标準分析器（Standard analyzer）

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#32-%E7%AE%80%E5%8D%95%E5%88%86%E6%9E%90%E5%99%A8simple-analyzer 3.2 簡單分析器（Simple analyzer）

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#33-%E7%A9%BA%E6%A0%BC%E5%88%86%E6%9E%90%E5%99%A8whitespace-analyzer 3.3 空格分析器（Whitespace analyzer）

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#34-%E8%AF%AD%E8%A8%80%E5%88%86%E6%9E%90%E5%99%A8language-analyzers 3.4 語言分析器（Language analyzers）

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#4-%E5%BD%93%E5%88%86%E6%9E%90%E5%99%A8%E8%A2%AB%E4%BD%BF%E7%94%A8 4. 當分析器被使用

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#5-%E6%B5%8B%E8%AF%95%E5%88%86%E6%9E%90%E5%99%A8 5. 測試分析器

https://note.youdao.com/md/?file=%2Fyws%2Fapi%2Fpersonal%2Ffile%2FWEBee8283cbee115a31e75ec274f99123e2%3Fmethod%3Ddownload%26read%3Dtrue#6-%E6%8C%87%E5%AE%9A%E5%88%86%E6%9E%90%E5%99%A8 6. 指定分析器

繼續閱讀

試分析如何把數組array中的所有元素循環右移p位

Flash AS3 連續加載外部若幹圖檔

手機軟體抓包工具及其使用方法

DB2表壓縮功能

推薦一些VB的學習交流網站

華為筆試軟體

項目管理那些事兒

OS --written test1

OS-written test2

壓縮編碼M-JPEG、MPEG4、H.264

轉詳解C#資料庫存取圖檔三大方式

GNU科學函數庫[參考手冊][v0.1 Build 090129 Beta][GNU Scientific Library]

與專家面對面：Android開發入門問與答

BMP檔案結構及圖像每行位元組計算方法

磁盤結構及在Linux中的命名

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合