天天看点

es - elasticsearch 自定义分析器 - 内建分词器 - 1

世界上并没有完美的程序,但是我们并不因此而沮丧,因为写程序就是一个不断追求完美的过程。

自定义分析器 :

  1. Character filters :

        1. 作用 : 字符的增、删、改转换

        2. 数量限制 : 可以有0个或多个

        3. 内建字符过滤器 :

            1. HTML Strip Character filter : 去除html标签

            2. Mapping Character filter : 映射替换

            3. Pattern Replace Character filter : 正则替换

  2. Tokenizer :

        1. 作用 :

            1. 分词

            2. 记录词的顺序和位置(短语查询)

            3. 记录词的开头和结尾位置(高亮)

            4. 记录词的类型(分类)

        2. 数量限制 : 有且只能有一个

        3. 分类 :

            1. 完整分词 :

                1. Standard

                2. Letter

                3. Lowercase

                4. whitespace

                5. UAX URL Email

                6. Classic

                7. Thai

            2. 切词 :

                1. N-Gram

                2. Edge N-Gram

            3. 文本 :

                1. Keyword

                2. Pattern

                3. Simple Pattern

                4. Char Group

                5. Simple Pattern split

                6. Path

  3. Token filters :

        1. 作用 : 分词的增、删、改转换

        2. 数量限制 : 可以有0个或多个

今天主要演示Tokenizer分类中的完整分词的分词器 :

# standard tokenizer
# 去除绝大部分符号
# 英文以词分,中文以字分
# 配置项 :
#   max_token_length - 每个词项的最大长度,默认 255
GET /_analyze
{
  "tokenizer": {
    "type" : "standard",
    "max_token_length" : 4
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中国人"
  ]
}

# 结果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "2",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "3",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "<NUM>",
      "position" : 7
    },
    {
      "token" : "5",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "<NUM>",
      "position" : 8
    },
    {
      "token" : "7",
      "start_offset" : 22,
      "end_offset" : 23,
      "type" : "<NUM>",
      "position" : 9
    },
    {
      "token" : "p",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "hell",
      "start_offset" : 27,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "o",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "p",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 13
    },
    {
      "token" : "span",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 14
    },
    {
      "token" : "good",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 15
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "<ALPHANUM>",
      "position" : 16
    },
    {
      "token" : "The",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "<ALPHANUM>",
      "position" : 117
    },
    {
      "token" : "2",
      "start_offset" : 59,
      "end_offset" : 60,
      "type" : "<NUM>",
      "position" : 118
    },
    {
      "token" : "QUIC",
      "start_offset" : 61,
      "end_offset" : 65,
      "type" : "<ALPHANUM>",
      "position" : 119
    },
    {
      "token" : "K",
      "start_offset" : 65,
      "end_offset" : 66,
      "type" : "<ALPHANUM>",
      "position" : 120
    },
    {
      "token" : "Brow",
      "start_offset" : 67,
      "end_offset" : 71,
      "type" : "<ALPHANUM>",
      "position" : 121
    },
    {
      "token" : "n",
      "start_offset" : 71,
      "end_offset" : 72,
      "type" : "<ALPHANUM>",
      "position" : 122
    },
    {
      "token" : "Foxe",
      "start_offset" : 73,
      "end_offset" : 77,
      "type" : "<ALPHANUM>",
      "position" : 123
    },
    {
      "token" : "s",
      "start_offset" : 77,
      "end_offset" : 78,
      "type" : "<ALPHANUM>",
      "position" : 124
    },
    {
      "token" : "jump",
      "start_offset" : 79,
      "end_offset" : 83,
      "type" : "<ALPHANUM>",
      "position" : 125
    },
    {
      "token" : "ed",
      "start_offset" : 83,
      "end_offset" : 85,
      "type" : "<ALPHANUM>",
      "position" : 126
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "<ALPHANUM>",
      "position" : 127
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "<ALPHANUM>",
      "position" : 128
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "<ALPHANUM>",
      "position" : 129
    },
    {
      "token" : "dog",
      "start_offset" : 100,
      "end_offset" : 103,
      "type" : "<ALPHANUM>",
      "position" : 130
    },
    {
      "token" : "s",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "<ALPHANUM>",
      "position" : 131
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "<ALPHANUM>",
      "position" : 132
    },
    {
      "token" : "我",
      "start_offset" : 112,
      "end_offset" : 113,
      "type" : "<IDEOGRAPHIC>",
      "position" : 233
    },
    {
      "token" : "是",
      "start_offset" : 113,
      "end_offset" : 114,
      "type" : "<IDEOGRAPHIC>",
      "position" : 234
    },
    {
      "token" : "中",
      "start_offset" : 114,
      "end_offset" : 115,
      "type" : "<IDEOGRAPHIC>",
      "position" : 235
    },
    {
      "token" : "国",
      "start_offset" : 115,
      "end_offset" : 116,
      "type" : "<IDEOGRAPHIC>",
      "position" : 236
    },
    {
      "token" : "人",
      "start_offset" : 116,
      "end_offset" : 117,
      "type" : "<IDEOGRAPHIC>",
      "position" : 237
    }
  ]
}

           
# letter tokenizer
# 英文分词,中文不分词
# 无符号
GET /_analyze
{
  "tokenizer": {
    "type" : "letter"
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中国人"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "p",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "hello",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "p",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "span",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "good",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "The",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "word",
      "position" : 112
    },
    {
      "token" : "QUICK",
      "start_offset" : 61,
      "end_offset" : 66,
      "type" : "word",
      "position" : 113
    },
    {
      "token" : "Brown",
      "start_offset" : 67,
      "end_offset" : 72,
      "type" : "word",
      "position" : 114
    },
    {
      "token" : "Foxes",
      "start_offset" : 73,
      "end_offset" : 78,
      "type" : "word",
      "position" : 115
    },
    {
      "token" : "jumped",
      "start_offset" : 79,
      "end_offset" : 85,
      "type" : "word",
      "position" : 116
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "word",
      "position" : 117
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "word",
      "position" : 118
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "word",
      "position" : 119
    },
    {
      "token" : "dog",
      "start_offset" : 100,
      "end_offset" : 103,
      "type" : "word",
      "position" : 120
    },
    {
      "token" : "s",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "word",
      "position" : 121
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "word",
      "position" : 122
    },
    {
      "token" : "我是中国人",
      "start_offset" : 112,
      "end_offset" : 117,
      "type" : "word",
      "position" : 223
    }
  ]
}

           
# lowercase tokenizer
# 与 letter tokenizer 类似
# 转小写
GET /_analyze
{
  "tokenizer": {
    "type" : "lowercase"
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中国人"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "p",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "hello",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "p",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "span",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "good",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "the",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "word",
      "position" : 112
    },
    {
      "token" : "quick",
      "start_offset" : 61,
      "end_offset" : 66,
      "type" : "word",
      "position" : 113
    },
    {
      "token" : "brown",
      "start_offset" : 67,
      "end_offset" : 72,
      "type" : "word",
      "position" : 114
    },
    {
      "token" : "foxes",
      "start_offset" : 73,
      "end_offset" : 78,
      "type" : "word",
      "position" : 115
    },
    {
      "token" : "jumped",
      "start_offset" : 79,
      "end_offset" : 85,
      "type" : "word",
      "position" : 116
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "word",
      "position" : 117
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "word",
      "position" : 118
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "word",
      "position" : 119
    },
    {
      "token" : "dog",
      "start_offset" : 100,
      "end_offset" : 103,
      "type" : "word",
      "position" : 120
    },
    {
      "token" : "s",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "word",
      "position" : 121
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "word",
      "position" : 122
    },
    {
      "token" : "我是中国人",
      "start_offset" : 112,
      "end_offset" : 117,
      "type" : "word",
      "position" : 223
    }
  ]
}

           
# whitespace tokenizer
# 空格分词
# 配置项 :
#   max_token_length
GET /_analyze
{
  "tokenizer": {
    "type" : "whitespace",
    "max_token_length" : 4
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中国人"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "2-3-",
      "start_offset" : 16,
      "end_offset" : 20,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "5-7",
      "start_offset" : 20,
      "end_offset" : 23,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "<p>h",
      "start_offset" : 24,
      "end_offset" : 28,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "ello",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "</p>",
      "start_offset" : 32,
      "end_offset" : 36,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "<spa",
      "start_offset" : 37,
      "end_offset" : 41,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "n>go",
      "start_offset" : 41,
      "end_offset" : 45,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "od</",
      "start_offset" : 45,
      "end_offset" : 49,
      "type" : "word",
      "position" : 13
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "word",
      "position" : 14
    },
    {
      "token" : ">",
      "start_offset" : 53,
      "end_offset" : 54,
      "type" : "word",
      "position" : 15
    },
    {
      "token" : "The",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "word",
      "position" : 116
    },
    {
      "token" : "2",
      "start_offset" : 59,
      "end_offset" : 60,
      "type" : "word",
      "position" : 117
    },
    {
      "token" : "QUIC",
      "start_offset" : 61,
      "end_offset" : 65,
      "type" : "word",
      "position" : 118
    },
    {
      "token" : "K",
      "start_offset" : 65,
      "end_offset" : 66,
      "type" : "word",
      "position" : 119
    },
    {
      "token" : "Brow",
      "start_offset" : 67,
      "end_offset" : 71,
      "type" : "word",
      "position" : 120
    },
    {
      "token" : "n-Fo",
      "start_offset" : 71,
      "end_offset" : 75,
      "type" : "word",
      "position" : 121
    },
    {
      "token" : "xes",
      "start_offset" : 75,
      "end_offset" : 78,
      "type" : "word",
      "position" : 122
    },
    {
      "token" : "jump",
      "start_offset" : 79,
      "end_offset" : 83,
      "type" : "word",
      "position" : 123
    },
    {
      "token" : "ed",
      "start_offset" : 83,
      "end_offset" : 85,
      "type" : "word",
      "position" : 124
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "word",
      "position" : 125
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "word",
      "position" : 126
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "word",
      "position" : 127
    },
    {
      "token" : "dog'",
      "start_offset" : 100,
      "end_offset" : 104,
      "type" : "word",
      "position" : 128
    },
    {
      "token" : "s",
      "start_offset" : 104,
      "end_offset" : 105,
      "type" : "word",
      "position" : 129
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "word",
      "position" : 130
    },
    {
      "token" : ".",
      "start_offset" : 110,
      "end_offset" : 111,
      "type" : "word",
      "position" : 131
    },
    {
      "token" : "我是中国",
      "start_offset" : 112,
      "end_offset" : 116,
      "type" : "word",
      "position" : 232
    },
    {
      "token" : "人",
      "start_offset" : 116,
      "end_offset" : 117,
      "type" : "word",
      "position" : 233
    }
  ]
}

           
# UAX URL Email tokenizer
# 与standard tokenizer类似
# 能够识别出url和email
# 配置项 :
#   max_token_length
GET /_analyze
{
  "tokenizer": {
    "type" : "uax_url_email"
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中国人",
  "[email protected]",
  "http://www.baidu.com"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "2",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "3",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "<NUM>",
      "position" : 7
    },
    {
      "token" : "5",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "<NUM>",
      "position" : 8
    },
    {
      "token" : "7",
      "start_offset" : 22,
      "end_offset" : 23,
      "type" : "<NUM>",
      "position" : 9
    },
    {
      "token" : "p",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "hello",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "p",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "span",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 13
    },
    {
      "token" : "good",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 14
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "<ALPHANUM>",
      "position" : 15
    },
    {
      "token" : "The",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "<ALPHANUM>",
      "position" : 116
    },
    {
      "token" : "2",
      "start_offset" : 59,
      "end_offset" : 60,
      "type" : "<NUM>",
      "position" : 117
    },
    {
      "token" : "QUICK",
      "start_offset" : 61,
      "end_offset" : 66,
      "type" : "<ALPHANUM>",
      "position" : 118
    },
    {
      "token" : "Brown",
      "start_offset" : 67,
      "end_offset" : 72,
      "type" : "<ALPHANUM>",
      "position" : 119
    },
    {
      "token" : "Foxes",
      "start_offset" : 73,
      "end_offset" : 78,
      "type" : "<ALPHANUM>",
      "position" : 120
    },
    {
      "token" : "jumped",
      "start_offset" : 79,
      "end_offset" : 85,
      "type" : "<ALPHANUM>",
      "position" : 121
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "<ALPHANUM>",
      "position" : 122
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "<ALPHANUM>",
      "position" : 123
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "<ALPHANUM>",
      "position" : 124
    },
    {
      "token" : "dog's",
      "start_offset" : 100,
      "end_offset" : 105,
      "type" : "<ALPHANUM>",
      "position" : 125
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "<ALPHANUM>",
      "position" : 126
    },
    {
      "token" : "我",
      "start_offset" : 112,
      "end_offset" : 113,
      "type" : "<IDEOGRAPHIC>",
      "position" : 227
    },
    {
      "token" : "是",
      "start_offset" : 113,
      "end_offset" : 114,
      "type" : "<IDEOGRAPHIC>",
      "position" : 228
    },
    {
      "token" : "中",
      "start_offset" : 114,
      "end_offset" : 115,
      "type" : "<IDEOGRAPHIC>",
      "position" : 229
    },
    {
      "token" : "国",
      "start_offset" : 115,
      "end_offset" : 116,
      "type" : "<IDEOGRAPHIC>",
      "position" : 230
    },
    {
      "token" : "人",
      "start_offset" : 116,
      "end_offset" : 117,
      "type" : "<IDEOGRAPHIC>",
      "position" : 231
    },
    {
      "token" : "[email protected]",
      "start_offset" : 118,
      "end_offset" : 129,
      "type" : "<EMAIL>",
      "position" : 332
    },
    {
      "token" : "http://www.baidu.com",
      "start_offset" : 130,
      "end_offset" : 150,
      "type" : "<URL>",
      "position" : 433
    }
  ]
}

           
# classic tokenizer
# 适用于英文
# 能识别 email host 缩写 公司名 等
# 配置项 :
#   max_token_length
GET /_analyze
{
  "tokenizer": {
    "type" : "classic"
  },
  "text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
  "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
  "我是中国人",
  "[email protected]",
  "http://www.baidu.com",
  "127.0.0.1",
  "2344232 fdgfd"]
}

# 结果
{
  "tokens" : [
    {
      "token" : "this",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "is",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "c",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "d",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "2-3-5-7",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "<NUM>",
      "position" : 6
    },
    {
      "token" : "p",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "hello",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "p",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "span",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "good",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "span",
      "start_offset" : 49,
      "end_offset" : 53,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "The",
      "start_offset" : 55,
      "end_offset" : 58,
      "type" : "<ALPHANUM>",
      "position" : 113
    },
    {
      "token" : "2",
      "start_offset" : 59,
      "end_offset" : 60,
      "type" : "<ALPHANUM>",
      "position" : 114
    },
    {
      "token" : "QUICK",
      "start_offset" : 61,
      "end_offset" : 66,
      "type" : "<ALPHANUM>",
      "position" : 115
    },
    {
      "token" : "Brown",
      "start_offset" : 67,
      "end_offset" : 72,
      "type" : "<ALPHANUM>",
      "position" : 116
    },
    {
      "token" : "Foxes",
      "start_offset" : 73,
      "end_offset" : 78,
      "type" : "<ALPHANUM>",
      "position" : 117
    },
    {
      "token" : "jumped",
      "start_offset" : 79,
      "end_offset" : 85,
      "type" : "<ALPHANUM>",
      "position" : 118
    },
    {
      "token" : "over",
      "start_offset" : 86,
      "end_offset" : 90,
      "type" : "<ALPHANUM>",
      "position" : 119
    },
    {
      "token" : "the",
      "start_offset" : 91,
      "end_offset" : 94,
      "type" : "<ALPHANUM>",
      "position" : 120
    },
    {
      "token" : "lazy",
      "start_offset" : 95,
      "end_offset" : 99,
      "type" : "<ALPHANUM>",
      "position" : 121
    },
    {
      "token" : "dog's",
      "start_offset" : 100,
      "end_offset" : 105,
      "type" : "<APOSTROPHE>",
      "position" : 122
    },
    {
      "token" : "bone",
      "start_offset" : 106,
      "end_offset" : 110,
      "type" : "<ALPHANUM>",
      "position" : 123
    },
    {
      "token" : "我",
      "start_offset" : 112,
      "end_offset" : 113,
      "type" : "<CJ>",
      "position" : 224
    },
    {
      "token" : "是",
      "start_offset" : 113,
      "end_offset" : 114,
      "type" : "<CJ>",
      "position" : 225
    },
    {
      "token" : "中",
      "start_offset" : 114,
      "end_offset" : 115,
      "type" : "<CJ>",
      "position" : 226
    },
    {
      "token" : "国",
      "start_offset" : 115,
      "end_offset" : 116,
      "type" : "<CJ>",
      "position" : 227
    },
    {
      "token" : "人",
      "start_offset" : 116,
      "end_offset" : 117,
      "type" : "<CJ>",
      "position" : 228
    },
    {
      "token" : "[email protected]",
      "start_offset" : 118,
      "end_offset" : 129,
      "type" : "<EMAIL>",
      "position" : 329
    },
    {
      "token" : "http",
      "start_offset" : 130,
      "end_offset" : 134,
      "type" : "<ALPHANUM>",
      "position" : 430
    },
    {
      "token" : "www.baidu.com",
      "start_offset" : 137,
      "end_offset" : 150,
      "type" : "<HOST>",
      "position" : 431
    },
    {
      "token" : "127.0.0.1",
      "start_offset" : 151,
      "end_offset" : 160,
      "type" : "<HOST>",
      "position" : 532
    },
    {
      "token" : "2344232",
      "start_offset" : 161,
      "end_offset" : 168,
      "type" : "<ALPHANUM>",
      "position" : 633
    },
    {
      "token" : "fdgfd",
      "start_offset" : 169,
      "end_offset" : 174,
      "type" : "<ALPHANUM>",
      "position" : 634
    }
  ]
}

           
# Thai tokenizer
# 泰语分词器 - 用不上 - 先不介绍
GET _analyze
{
  "tokenizer": "thai",
  "text": "การที่ได้ต้องแสดงว่างานดี"
}

结果 :
[ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]
           

继续阅读