es - elasticsearch 自定義分析器 - 内建分詞器 - 1

世界上并沒有完美的程式，但是我們并不是以而沮喪，因為寫程式就是一個不斷追求完美的過程。

自定義分析器 :

Character filters :

    1. 作用 : 字元的增、删、改轉換

    2. 數量限制 : 可以有0個或多個

    3. 内建字元過濾器 :

        1. HTML Strip Character filter : 去除html标簽

        2. Mapping Character filter : 映射替換

        3. Pattern Replace Character filter : 正則替換
Tokenizer :

    1. 作用 :

        1. 分詞

        2. 記錄詞的順序和位置（短語查詢）

        3. 記錄詞的開頭和結尾位置（高亮）

        4. 記錄詞的類型（分類）

    2. 數量限制 : 有且隻能有一個

    3. 分類 :

        1. 完整分詞 :

            1. Standard

            2. Letter

            3. Lowercase

            4. whitespace

            5. UAX URL Email

            6. Classic

            7. Thai

        2. 切詞 :

            1. N-Gram

            2. Edge N-Gram

        3. 文本 :

            1. Keyword

            2. Pattern

            3. Simple Pattern

            4. Char Group

            5. Simple Pattern split

            6. Path
Token filters :

1. 作用 : 分詞的增、删、改轉換

2. 數量限制 : 可以有0個或多個

今天主要示範Tokenizer分類中的完整分詞的分詞器 :

# standard tokenizer
# 去除絕大部分符号
# 英文以詞分，中文以字分
# 配置項 :
#   max_token_length - 每個詞項的最大長度，預設 255
GET /_analyze
{
  "tokenizer": {
    "type" : "standard",
    "max_token_length" : 4
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中國人"
  ]
}

# 結果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "2",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "3",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "<NUM>",
      "position" : 7
    },
    {
      "token" : "5",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "<NUM>",
      "position" : 8
    },
    {
      "token" : "7",
      "start_offset" : 22,
      "end_offset" : 23,
      "type" : "<NUM>",
      "position" : 9
    },
    {
      "token" : "p",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "hell",
      "start_offset" : 27,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "o",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "p",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 13
    },
    {
      "token" : "span",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 14
    },
    {
      "token" : "good",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 15
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "<ALPHANUM>",
      "position" : 16
    },
    {
      "token" : "The",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "<ALPHANUM>",
      "position" : 117
    },
    {
      "token" : "2",
      "start_offset" : 59,
      "end_offset" : 60,
      "type" : "<NUM>",
      "position" : 118
    },
    {
      "token" : "QUIC",
      "start_offset" : 61,
      "end_offset" : 65,
      "type" : "<ALPHANUM>",
      "position" : 119
    },
    {
      "token" : "K",
      "start_offset" : 65,
      "end_offset" : 66,
      "type" : "<ALPHANUM>",
      "position" : 120
    },
    {
      "token" : "Brow",
      "start_offset" : 67,
      "end_offset" : 71,
      "type" : "<ALPHANUM>",
      "position" : 121
    },
    {
      "token" : "n",
      "start_offset" : 71,
      "end_offset" : 72,
      "type" : "<ALPHANUM>",
      "position" : 122
    },
    {
      "token" : "Foxe",
      "start_offset" : 73,
      "end_offset" : 77,
      "type" : "<ALPHANUM>",
      "position" : 123
    },
    {
      "token" : "s",
      "start_offset" : 77,
      "end_offset" : 78,
      "type" : "<ALPHANUM>",
      "position" : 124
    },
    {
      "token" : "jump",
      "start_offset" : 79,
      "end_offset" : 83,
      "type" : "<ALPHANUM>",
      "position" : 125
    },
    {
      "token" : "ed",
      "start_offset" : 83,
      "end_offset" : 85,
      "type" : "<ALPHANUM>",
      "position" : 126
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "<ALPHANUM>",
      "position" : 127
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "<ALPHANUM>",
      "position" : 128
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "<ALPHANUM>",
      "position" : 129
    },
    {
      "token" : "dog",
      "start_offset" : 100,
      "end_offset" : 103,
      "type" : "<ALPHANUM>",
      "position" : 130
    },
    {
      "token" : "s",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "<ALPHANUM>",
      "position" : 131
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "<ALPHANUM>",
      "position" : 132
    },
    {
      "token" : "我",
      "start_offset" : 112,
      "end_offset" : 113,
      "type" : "<IDEOGRAPHIC>",
      "position" : 233
    },
    {
      "token" : "是",
      "start_offset" : 113,
      "end_offset" : 114,
      "type" : "<IDEOGRAPHIC>",
      "position" : 234
    },
    {
      "token" : "中",
      "start_offset" : 114,
      "end_offset" : 115,
      "type" : "<IDEOGRAPHIC>",
      "position" : 235
    },
    {
      "token" : "國",
      "start_offset" : 115,
      "end_offset" : 116,
      "type" : "<IDEOGRAPHIC>",
      "position" : 236
    },
    {
      "token" : "人",
      "start_offset" : 116,
      "end_offset" : 117,
      "type" : "<IDEOGRAPHIC>",
      "position" : 237
    }
  ]
}

# letter tokenizer
# 英文分詞，中文不分詞
# 無符号
GET /_analyze
{
  "tokenizer": {
    "type" : "letter"
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中國人"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "p",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "hello",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "p",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "span",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "good",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "The",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "word",
      "position" : 112
    },
    {
      "token" : "QUICK",
      "start_offset" : 61,
      "end_offset" : 66,
      "type" : "word",
      "position" : 113
    },
    {
      "token" : "Brown",
      "start_offset" : 67,
      "end_offset" : 72,
      "type" : "word",
      "position" : 114
    },
    {
      "token" : "Foxes",
      "start_offset" : 73,
      "end_offset" : 78,
      "type" : "word",
      "position" : 115
    },
    {
      "token" : "jumped",
      "start_offset" : 79,
      "end_offset" : 85,
      "type" : "word",
      "position" : 116
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "word",
      "position" : 117
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "word",
      "position" : 118
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "word",
      "position" : 119
    },
    {
      "token" : "dog",
      "start_offset" : 100,
      "end_offset" : 103,
      "type" : "word",
      "position" : 120
    },
    {
      "token" : "s",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "word",
      "position" : 121
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "word",
      "position" : 122
    },
    {
      "token" : "我是中國人",
      "start_offset" : 112,
      "end_offset" : 117,
      "type" : "word",
      "position" : 223
    }
  ]
}

# lowercase tokenizer
# 與 letter tokenizer 類似
# 轉小寫
GET /_analyze
{
  "tokenizer": {
    "type" : "lowercase"
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中國人"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "p",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "hello",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "p",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "span",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "good",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "the",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "word",
      "position" : 112
    },
    {
      "token" : "quick",
      "start_offset" : 61,
      "end_offset" : 66,
      "type" : "word",
      "position" : 113
    },
    {
      "token" : "brown",
      "start_offset" : 67,
      "end_offset" : 72,
      "type" : "word",
      "position" : 114
    },
    {
      "token" : "foxes",
      "start_offset" : 73,
      "end_offset" : 78,
      "type" : "word",
      "position" : 115
    },
    {
      "token" : "jumped",
      "start_offset" : 79,
      "end_offset" : 85,
      "type" : "word",
      "position" : 116
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "word",
      "position" : 117
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "word",
      "position" : 118
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "word",
      "position" : 119
    },
    {
      "token" : "dog",
      "start_offset" : 100,
      "end_offset" : 103,
      "type" : "word",
      "position" : 120
    },
    {
      "token" : "s",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "word",
      "position" : 121
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "word",
      "position" : 122
    },
    {
      "token" : "我是中國人",
      "start_offset" : 112,
      "end_offset" : 117,
      "type" : "word",
      "position" : 223
    }
  ]
}

# whitespace tokenizer
# 空格分詞
# 配置項 :
#   max_token_length
GET /_analyze
{
  "tokenizer": {
    "type" : "whitespace",
    "max_token_length" : 4
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中國人"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "2-3-",
      "start_offset" : 16,
      "end_offset" : 20,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "5-7",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "<p>h",
      "start_offset" : 24,
      "end_offset" : 28,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "ello",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "</p>",
      "start_offset" : 32,
      "end_offset" : 36,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "<spa",
      "start_offset" : 37,
      "end_offset" : 41,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "n>go",
      "start_offset" : 41,
      "end_offset" : 45,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "od</",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "word",
      "position" : 14
    },
    {
      "token" : ">",
      "start_offset" : 53,
      "end_offset" : 54,
      "type" : "word",
      "position" : 15
    },
    {
      "token" : "The",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "word",
      "position" : 116
    },
    {
      "token" : "2",
      "start_offset" : 59,
      "end_offset" : 60,
      "type" : "word",
      "position" : 117
    },
    {
      "token" : "QUIC",
      "start_offset" : 61,
      "end_offset" : 65,
      "type" : "word",
      "position" : 118
    },
    {
      "token" : "K",
      "start_offset" : 65,
      "end_offset" : 66,
      "type" : "word",
      "position" : 119
    },
    {
      "token" : "Brow",
      "start_offset" : 67,
      "end_offset" : 71,
      "type" : "word",
      "position" : 120
    },
    {
      "token" : "n-Fo",
      "start_offset" : 71,
      "end_offset" : 75,
      "type" : "word",
      "position" : 121
    },
    {
      "token" : "xes",
      "start_offset" : 75,
      "end_offset" : 78,
      "type" : "word",
      "position" : 122
    },
    {
      "token" : "jump",
      "start_offset" : 79,
      "end_offset" : 83,
      "type" : "word",
      "position" : 123
    },
    {
      "token" : "ed",
      "start_offset" : 83,
      "end_offset" : 85,
      "type" : "word",
      "position" : 124
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "word",
      "position" : 125
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "word",
      "position" : 126
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "word",
      "position" : 127
    },
    {
      "token" : "dog'",
      "start_offset" : 100,
      "end_offset" : 104,
      "type" : "word",
      "position" : 128
    },
    {
      "token" : "s",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "word",
      "position" : 129
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "word",
      "position" : 130
    },
    {
      "token" : ".",
      "start_offset" : 110,
      "end_offset" : 111,
      "type" : "word",
      "position" : 131
    },
    {
      "token" : "我是中國",
      "start_offset" : 112,
      "end_offset" : 116,
      "type" : "word",
      "position" : 232
    },
    {
      "token" : "人",
      "start_offset" : 116,
      "end_offset" : 117,
      "type" : "word",
      "position" : 233
    }
  ]
}

# UAX URL Email tokenizer
# 與standard tokenizer類似
# 能夠識别出url和email
# 配置項 :
#   max_token_length
GET /_analyze
{
  "tokenizer": {
    "type" : "uax_url_email"
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中國人",
  "[email protected]",
  "http://www.baidu.com"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "2",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "3",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "<NUM>",
      "position" : 7
    },
    {
      "token" : "5",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "<NUM>",
      "position" : 8
    },
    {
      "token" : "7",
      "start_offset" : 22,
      "end_offset" : 23,
      "type" : "<NUM>",
      "position" : 9
    },
    {
      "token" : "p",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "hello",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "p",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "span",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 13
    },
    {
      "token" : "good",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 14
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "<ALPHANUM>",
      "position" : 15
    },
    {
      "token" : "The",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "<ALPHANUM>",
      "position" : 116
    },
    {
      "token" : "2",
      "start_offset" : 59,
      "end_offset" : 60,
      "type" : "<NUM>",
      "position" : 117
    },
    {
      "token" : "QUICK",
      "start_offset" : 61,
      "end_offset" : 66,
      "type" : "<ALPHANUM>",
      "position" : 118
    },
    {
      "token" : "Brown",
      "start_offset" : 67,
      "end_offset" : 72,
      "type" : "<ALPHANUM>",
      "position" : 119
    },
    {
      "token" : "Foxes",
      "start_offset" : 73,
      "end_offset" : 78,
      "type" : "<ALPHANUM>",
      "position" : 120
    },
    {
      "token" : "jumped",
      "start_offset" : 79,
      "end_offset" : 85,
      "type" : "<ALPHANUM>",
      "position" : 121
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "<ALPHANUM>",
      "position" : 122
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "<ALPHANUM>",
      "position" : 123
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "<ALPHANUM>",
      "position" : 124
    },
    {
      "token" : "dog's",
      "start_offset" : 100,
      "end_offset" : 105,
      "type" : "<ALPHANUM>",
      "position" : 125
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "<ALPHANUM>",
      "position" : 126
    },
    {
      "token" : "我",
      "start_offset" : 112,
      "end_offset" : 113,
      "type" : "<IDEOGRAPHIC>",
      "position" : 227
    },
    {
      "token" : "是",
      "start_offset" : 113,
      "end_offset" : 114,
      "type" : "<IDEOGRAPHIC>",
      "position" : 228
    },
    {
      "token" : "中",
      "start_offset" : 114,
      "end_offset" : 115,
      "type" : "<IDEOGRAPHIC>",
      "position" : 229
    },
    {
      "token" : "國",
      "start_offset" : 115,
      "end_offset" : 116,
      "type" : "<IDEOGRAPHIC>",
      "position" : 230
    },
    {
      "token" : "人",
      "start_offset" : 116,
      "end_offset" : 117,
      "type" : "<IDEOGRAPHIC>",
      "position" : 231
    },
    {
      "token" : "[email protected]",
      "start_offset" : 118,
      "end_offset" : 129,
      "type" : "<EMAIL>",
      "position" : 332
    },
    {
      "token" : "http://www.baidu.com",
      "start_offset" : 130,
      "end_offset" : 150,
      "type" : "<URL>",
      "position" : 433
    }
  ]
}

# classic tokenizer
# 适用于英文
# 能識别 email host 縮寫 公司名 等
# 配置項 :
#   max_token_length
GET /_analyze
{
  "tokenizer": {
    "type" : "classic"
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中國人",
  "[email protected]",
  "http://www.baidu.com",
  "127.0.0.1",
  "2344232 fdgfd"]
}

# 結果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "2-3-5-7",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "p",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "hello",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "p",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "span",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "good",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "The",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "<ALPHANUM>",
      "position" : 113
    },
    {
      "token" : "2",
      "start_offset" : 59,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 114
    },
    {
      "token" : "QUICK",
      "start_offset" : 61,
      "end_offset" : 66,
      "type" : "<ALPHANUM>",
      "position" : 115
    },
    {
      "token" : "Brown",
      "start_offset" : 67,
      "end_offset" : 72,
      "type" : "<ALPHANUM>",
      "position" : 116
    },
    {
      "token" : "Foxes",
      "start_offset" : 73,
      "end_offset" : 78,
      "type" : "<ALPHANUM>",
      "position" : 117
    },
    {
      "token" : "jumped",
      "start_offset" : 79,
      "end_offset" : 85,
      "type" : "<ALPHANUM>",
      "position" : 118
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "<ALPHANUM>",
      "position" : 119
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "<ALPHANUM>",
      "position" : 120
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "<ALPHANUM>",
      "position" : 121
    },
    {
      "token" : "dog's",
      "start_offset" : 100,
      "end_offset" : 105,
      "type" : "<APOSTROPHE>",
      "position" : 122
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "<ALPHANUM>",
      "position" : 123
    },
    {
      "token" : "我",
      "start_offset" : 112,
      "end_offset" : 113,
      "type" : "<CJ>",
      "position" : 224
    },
    {
      "token" : "是",
      "start_offset" : 113,
      "end_offset" : 114,
      "type" : "<CJ>",
      "position" : 225
    },
    {
      "token" : "中",
      "start_offset" : 114,
      "end_offset" : 115,
      "type" : "<CJ>",
      "position" : 226
    },
    {
      "token" : "國",
      "start_offset" : 115,
      "end_offset" : 116,
      "type" : "<CJ>",
      "position" : 227
    },
    {
      "token" : "人",
      "start_offset" : 116,
      "end_offset" : 117,
      "type" : "<CJ>",
      "position" : 228
    },
    {
      "token" : "[email protected]",
      "start_offset" : 118,
      "end_offset" : 129,
      "type" : "<EMAIL>",
      "position" : 329
    },
    {
      "token" : "http",
      "start_offset" : 130,
      "end_offset" : 134,
      "type" : "<ALPHANUM>",
      "position" : 430
    },
    {
      "token" : "www.baidu.com",
      "start_offset" : 137,
      "end_offset" : 150,
      "type" : "<HOST>",
      "position" : 431
    },
    {
      "token" : "127.0.0.1",
      "start_offset" : 151,
      "end_offset" : 160,
      "type" : "<HOST>",
      "position" : 532
    },
    {
      "token" : "2344232",
      "start_offset" : 161,
      "end_offset" : 168,
      "type" : "<ALPHANUM>",
      "position" : 633
    },
    {
      "token" : "fdgfd",
      "start_offset" : 169,
      "end_offset" : 174,
      "type" : "<ALPHANUM>",
      "position" : 634
    }
  ]
}

# Thai tokenizer
# 泰語分詞器 - 用不上 - 先不介紹
GET _analyze
{
  "tokenizer": "thai",
  "text": "การที่ได้ต้องแสดงว่างานดี"
}

結果 :
[ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]

es - elasticsearch 自定義分析器 - 内建分詞器 - 1

繼續閱讀

elasticsearch的副本和分片的差別

Security Bulletin_Vulnerability in Apache Log4j affects WAS_CVE-2021-44228

ESLint 報錯解決

js手冊之reduce和isArray解析和使用

源碼分析Elasticsearch Master選舉過程

【Docker系列】在 Docker 中部署 Elasticsearch

ES（二）ES安裝及叢集的搭建

從MySQL到HBase：分庫分表方案轉型的演進Why Not NoSQL/NewSQL?Why Not 分區?Why 分庫分表?

一個線上問題引發的思考——Elasticsearch 8.X 如何實作更精準的檢索？

RestClient的TimeOut和Can't assign requested address問題排查及解決RestClient的TimeOut和Can’t assign requested address問題排查及解決

Elasticsearch 故障黃色或紅色解決辦法

Logstatsh安裝使用說明

NEST.net Client For Elasticsearch簡單應用

Elasticsearch 8.X 新官方文檔不好用，怎麼辦？

Elasticsearch 使用copy_to組合字段進行查詢copy_to 介紹

解決es 高亮查詢片段問題