世界上并没有完美的程序,但是我们并不因此而沮丧,因为写程序就是一个不断追求完美的过程。
自定义分析器 :
-
Character filters :
1. 作用 : 字符的增、删、改转换
2. 数量限制 : 可以有0个或多个
3. 内建字符过滤器 :
1. HTML Strip Character filter : 去除html标签
2. Mapping Character filter : 映射替换
3. Pattern Replace Character filter : 正则替换
-
Tokenizer :
1. 作用 :
1. 分词
2. 记录词的顺序和位置(短语查询)
3. 记录词的开头和结尾位置(高亮)
4. 记录词的类型(分类)
2. 数量限制 : 有且只能有一个
3. 分类 :
1. 完整分词 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切词 :
1. N-Gram
2. Edge N-Gram
3. 文本 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path
-
Token filters :
1. 作用 : 分词的增、删、改转换
2. 数量限制 : 可以有0个或多个
今天主要演示Tokenizer分类中的完整分词的分词器 :
# standard tokenizer
# 去除绝大部分符号
# 英文以词分,中文以字分
# 配置项 :
# max_token_length - 每个词项的最大长度,默认 255
GET /_analyze
{
"tokenizer": {
"type" : "standard",
"max_token_length" : 4
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中国人"
]
}
# 结果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "2",
"start_offset" : 16,
"end_offset" : 17,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "3",
"start_offset" : 18,
"end_offset" : 19,
"type" : "<NUM>",
"position" : 7
},
{
"token" : "5",
"start_offset" : 20,
"end_offset" : 21,
"type" : "<NUM>",
"position" : 8
},
{
"token" : "7",
"start_offset" : 22,
"end_offset" : 23,
"type" : "<NUM>",
"position" : 9
},
{
"token" : "p",
"start_offset" : 25,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "hell",
"start_offset" : 27,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "o",
"start_offset" : 31,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 12
},
{
"token" : "p",
"start_offset" : 34,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 13
},
{
"token" : "span",
"start_offset" : 38,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 14
},
{
"token" : "good",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 15
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "<ALPHANUM>",
"position" : 16
},
{
"token" : "The",
"start_offset" : 55,
"end_offset" : 58,
"type" : "<ALPHANUM>",
"position" : 117
},
{
"token" : "2",
"start_offset" : 59,
"end_offset" : 60,
"type" : "<NUM>",
"position" : 118
},
{
"token" : "QUIC",
"start_offset" : 61,
"end_offset" : 65,
"type" : "<ALPHANUM>",
"position" : 119
},
{
"token" : "K",
"start_offset" : 65,
"end_offset" : 66,
"type" : "<ALPHANUM>",
"position" : 120
},
{
"token" : "Brow",
"start_offset" : 67,
"end_offset" : 71,
"type" : "<ALPHANUM>",
"position" : 121
},
{
"token" : "n",
"start_offset" : 71,
"end_offset" : 72,
"type" : "<ALPHANUM>",
"position" : 122
},
{
"token" : "Foxe",
"start_offset" : 73,
"end_offset" : 77,
"type" : "<ALPHANUM>",
"position" : 123
},
{
"token" : "s",
"start_offset" : 77,
"end_offset" : 78,
"type" : "<ALPHANUM>",
"position" : 124
},
{
"token" : "jump",
"start_offset" : 79,
"end_offset" : 83,
"type" : "<ALPHANUM>",
"position" : 125
},
{
"token" : "ed",
"start_offset" : 83,
"end_offset" : 85,
"type" : "<ALPHANUM>",
"position" : 126
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "<ALPHANUM>",
"position" : 127
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "<ALPHANUM>",
"position" : 128
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "<ALPHANUM>",
"position" : 129
},
{
"token" : "dog",
"start_offset" : 100,
"end_offset" : 103,
"type" : "<ALPHANUM>",
"position" : 130
},
{
"token" : "s",
"start_offset" : 104,
"end_offset" : 105,
"type" : "<ALPHANUM>",
"position" : 131
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "<ALPHANUM>",
"position" : 132
},
{
"token" : "我",
"start_offset" : 112,
"end_offset" : 113,
"type" : "<IDEOGRAPHIC>",
"position" : 233
},
{
"token" : "是",
"start_offset" : 113,
"end_offset" : 114,
"type" : "<IDEOGRAPHIC>",
"position" : 234
},
{
"token" : "中",
"start_offset" : 114,
"end_offset" : 115,
"type" : "<IDEOGRAPHIC>",
"position" : 235
},
{
"token" : "国",
"start_offset" : 115,
"end_offset" : 116,
"type" : "<IDEOGRAPHIC>",
"position" : 236
},
{
"token" : "人",
"start_offset" : 116,
"end_offset" : 117,
"type" : "<IDEOGRAPHIC>",
"position" : 237
}
]
}
# letter tokenizer
# 英文分词,中文不分词
# 无符号
GET /_analyze
{
"tokenizer": {
"type" : "letter"
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中国人"]
}
# 结果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "word",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "word",
"position" : 5
},
{
"token" : "p",
"start_offset" : 25,
"end_offset" : 26,
"type" : "word",
"position" : 6
},
{
"token" : "hello",
"start_offset" : 27,
"end_offset" : 32,
"type" : "word",
"position" : 7
},
{
"token" : "p",
"start_offset" : 34,
"end_offset" : 35,
"type" : "word",
"position" : 8
},
{
"token" : "span",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 9
},
{
"token" : "good",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 10
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "word",
"position" : 11
},
{
"token" : "The",
"start_offset" : 55,
"end_offset" : 58,
"type" : "word",
"position" : 112
},
{
"token" : "QUICK",
"start_offset" : 61,
"end_offset" : 66,
"type" : "word",
"position" : 113
},
{
"token" : "Brown",
"start_offset" : 67,
"end_offset" : 72,
"type" : "word",
"position" : 114
},
{
"token" : "Foxes",
"start_offset" : 73,
"end_offset" : 78,
"type" : "word",
"position" : 115
},
{
"token" : "jumped",
"start_offset" : 79,
"end_offset" : 85,
"type" : "word",
"position" : 116
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "word",
"position" : 117
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "word",
"position" : 118
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "word",
"position" : 119
},
{
"token" : "dog",
"start_offset" : 100,
"end_offset" : 103,
"type" : "word",
"position" : 120
},
{
"token" : "s",
"start_offset" : 104,
"end_offset" : 105,
"type" : "word",
"position" : 121
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "word",
"position" : 122
},
{
"token" : "我是中国人",
"start_offset" : 112,
"end_offset" : 117,
"type" : "word",
"position" : 223
}
]
}
# lowercase tokenizer
# 与 letter tokenizer 类似
# 转小写
GET /_analyze
{
"tokenizer": {
"type" : "lowercase"
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中国人"]
}
# 结果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "word",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "word",
"position" : 5
},
{
"token" : "p",
"start_offset" : 25,
"end_offset" : 26,
"type" : "word",
"position" : 6
},
{
"token" : "hello",
"start_offset" : 27,
"end_offset" : 32,
"type" : "word",
"position" : 7
},
{
"token" : "p",
"start_offset" : 34,
"end_offset" : 35,
"type" : "word",
"position" : 8
},
{
"token" : "span",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 9
},
{
"token" : "good",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 10
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "word",
"position" : 11
},
{
"token" : "the",
"start_offset" : 55,
"end_offset" : 58,
"type" : "word",
"position" : 112
},
{
"token" : "quick",
"start_offset" : 61,
"end_offset" : 66,
"type" : "word",
"position" : 113
},
{
"token" : "brown",
"start_offset" : 67,
"end_offset" : 72,
"type" : "word",
"position" : 114
},
{
"token" : "foxes",
"start_offset" : 73,
"end_offset" : 78,
"type" : "word",
"position" : 115
},
{
"token" : "jumped",
"start_offset" : 79,
"end_offset" : 85,
"type" : "word",
"position" : 116
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "word",
"position" : 117
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "word",
"position" : 118
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "word",
"position" : 119
},
{
"token" : "dog",
"start_offset" : 100,
"end_offset" : 103,
"type" : "word",
"position" : 120
},
{
"token" : "s",
"start_offset" : 104,
"end_offset" : 105,
"type" : "word",
"position" : 121
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "word",
"position" : 122
},
{
"token" : "我是中国人",
"start_offset" : 112,
"end_offset" : 117,
"type" : "word",
"position" : 223
}
]
}
# whitespace tokenizer
# 空格分词
# 配置项 :
# max_token_length
GET /_analyze
{
"tokenizer": {
"type" : "whitespace",
"max_token_length" : 4
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中国人"]
}
# 结果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "word",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "word",
"position" : 5
},
{
"token" : "2-3-",
"start_offset" : 16,
"end_offset" : 20,
"type" : "word",
"position" : 6
},
{
"token" : "5-7",
"start_offset" : 20,
"end_offset" : 23,
"type" : "word",
"position" : 7
},
{
"token" : "<p>h",
"start_offset" : 24,
"end_offset" : 28,
"type" : "word",
"position" : 8
},
{
"token" : "ello",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 9
},
{
"token" : "</p>",
"start_offset" : 32,
"end_offset" : 36,
"type" : "word",
"position" : 10
},
{
"token" : "<spa",
"start_offset" : 37,
"end_offset" : 41,
"type" : "word",
"position" : 11
},
{
"token" : "n>go",
"start_offset" : 41,
"end_offset" : 45,
"type" : "word",
"position" : 12
},
{
"token" : "od</",
"start_offset" : 45,
"end_offset" : 49,
"type" : "word",
"position" : 13
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "word",
"position" : 14
},
{
"token" : ">",
"start_offset" : 53,
"end_offset" : 54,
"type" : "word",
"position" : 15
},
{
"token" : "The",
"start_offset" : 55,
"end_offset" : 58,
"type" : "word",
"position" : 116
},
{
"token" : "2",
"start_offset" : 59,
"end_offset" : 60,
"type" : "word",
"position" : 117
},
{
"token" : "QUIC",
"start_offset" : 61,
"end_offset" : 65,
"type" : "word",
"position" : 118
},
{
"token" : "K",
"start_offset" : 65,
"end_offset" : 66,
"type" : "word",
"position" : 119
},
{
"token" : "Brow",
"start_offset" : 67,
"end_offset" : 71,
"type" : "word",
"position" : 120
},
{
"token" : "n-Fo",
"start_offset" : 71,
"end_offset" : 75,
"type" : "word",
"position" : 121
},
{
"token" : "xes",
"start_offset" : 75,
"end_offset" : 78,
"type" : "word",
"position" : 122
},
{
"token" : "jump",
"start_offset" : 79,
"end_offset" : 83,
"type" : "word",
"position" : 123
},
{
"token" : "ed",
"start_offset" : 83,
"end_offset" : 85,
"type" : "word",
"position" : 124
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "word",
"position" : 125
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "word",
"position" : 126
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "word",
"position" : 127
},
{
"token" : "dog'",
"start_offset" : 100,
"end_offset" : 104,
"type" : "word",
"position" : 128
},
{
"token" : "s",
"start_offset" : 104,
"end_offset" : 105,
"type" : "word",
"position" : 129
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "word",
"position" : 130
},
{
"token" : ".",
"start_offset" : 110,
"end_offset" : 111,
"type" : "word",
"position" : 131
},
{
"token" : "我是中国",
"start_offset" : 112,
"end_offset" : 116,
"type" : "word",
"position" : 232
},
{
"token" : "人",
"start_offset" : 116,
"end_offset" : 117,
"type" : "word",
"position" : 233
}
]
}
# UAX URL Email tokenizer
# 与standard tokenizer类似
# 能够识别出url和email
# 配置项 :
# max_token_length
GET /_analyze
{
"tokenizer": {
"type" : "uax_url_email"
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中国人",
"[email protected]",
"http://www.baidu.com"]
}
# 结果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "2",
"start_offset" : 16,
"end_offset" : 17,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "3",
"start_offset" : 18,
"end_offset" : 19,
"type" : "<NUM>",
"position" : 7
},
{
"token" : "5",
"start_offset" : 20,
"end_offset" : 21,
"type" : "<NUM>",
"position" : 8
},
{
"token" : "7",
"start_offset" : 22,
"end_offset" : 23,
"type" : "<NUM>",
"position" : 9
},
{
"token" : "p",
"start_offset" : 25,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "hello",
"start_offset" : 27,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "p",
"start_offset" : 34,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 12
},
{
"token" : "span",
"start_offset" : 38,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 13
},
{
"token" : "good",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 14
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "<ALPHANUM>",
"position" : 15
},
{
"token" : "The",
"start_offset" : 55,
"end_offset" : 58,
"type" : "<ALPHANUM>",
"position" : 116
},
{
"token" : "2",
"start_offset" : 59,
"end_offset" : 60,
"type" : "<NUM>",
"position" : 117
},
{
"token" : "QUICK",
"start_offset" : 61,
"end_offset" : 66,
"type" : "<ALPHANUM>",
"position" : 118
},
{
"token" : "Brown",
"start_offset" : 67,
"end_offset" : 72,
"type" : "<ALPHANUM>",
"position" : 119
},
{
"token" : "Foxes",
"start_offset" : 73,
"end_offset" : 78,
"type" : "<ALPHANUM>",
"position" : 120
},
{
"token" : "jumped",
"start_offset" : 79,
"end_offset" : 85,
"type" : "<ALPHANUM>",
"position" : 121
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "<ALPHANUM>",
"position" : 122
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "<ALPHANUM>",
"position" : 123
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "<ALPHANUM>",
"position" : 124
},
{
"token" : "dog's",
"start_offset" : 100,
"end_offset" : 105,
"type" : "<ALPHANUM>",
"position" : 125
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "<ALPHANUM>",
"position" : 126
},
{
"token" : "我",
"start_offset" : 112,
"end_offset" : 113,
"type" : "<IDEOGRAPHIC>",
"position" : 227
},
{
"token" : "是",
"start_offset" : 113,
"end_offset" : 114,
"type" : "<IDEOGRAPHIC>",
"position" : 228
},
{
"token" : "中",
"start_offset" : 114,
"end_offset" : 115,
"type" : "<IDEOGRAPHIC>",
"position" : 229
},
{
"token" : "国",
"start_offset" : 115,
"end_offset" : 116,
"type" : "<IDEOGRAPHIC>",
"position" : 230
},
{
"token" : "人",
"start_offset" : 116,
"end_offset" : 117,
"type" : "<IDEOGRAPHIC>",
"position" : 231
},
{
"token" : "[email protected]",
"start_offset" : 118,
"end_offset" : 129,
"type" : "<EMAIL>",
"position" : 332
},
{
"token" : "http://www.baidu.com",
"start_offset" : 130,
"end_offset" : 150,
"type" : "<URL>",
"position" : 433
}
]
}
# classic tokenizer
# 适用于英文
# 能识别 email host 缩写 公司名 等
# 配置项 :
# max_token_length
GET /_analyze
{
"tokenizer": {
"type" : "classic"
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中国人",
"[email protected]",
"http://www.baidu.com",
"127.0.0.1",
"2344232 fdgfd"]
}
# 结果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "2-3-5-7",
"start_offset" : 16,
"end_offset" : 23,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "p",
"start_offset" : 25,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "hello",
"start_offset" : 27,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "p",
"start_offset" : 34,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "span",
"start_offset" : 38,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "good",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "<ALPHANUM>",
"position" : 12
},
{
"token" : "The",
"start_offset" : 55,
"end_offset" : 58,
"type" : "<ALPHANUM>",
"position" : 113
},
{
"token" : "2",
"start_offset" : 59,
"end_offset" : 60,
"type" : "<ALPHANUM>",
"position" : 114
},
{
"token" : "QUICK",
"start_offset" : 61,
"end_offset" : 66,
"type" : "<ALPHANUM>",
"position" : 115
},
{
"token" : "Brown",
"start_offset" : 67,
"end_offset" : 72,
"type" : "<ALPHANUM>",
"position" : 116
},
{
"token" : "Foxes",
"start_offset" : 73,
"end_offset" : 78,
"type" : "<ALPHANUM>",
"position" : 117
},
{
"token" : "jumped",
"start_offset" : 79,
"end_offset" : 85,
"type" : "<ALPHANUM>",
"position" : 118
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "<ALPHANUM>",
"position" : 119
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "<ALPHANUM>",
"position" : 120
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "<ALPHANUM>",
"position" : 121
},
{
"token" : "dog's",
"start_offset" : 100,
"end_offset" : 105,
"type" : "<APOSTROPHE>",
"position" : 122
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "<ALPHANUM>",
"position" : 123
},
{
"token" : "我",
"start_offset" : 112,
"end_offset" : 113,
"type" : "<CJ>",
"position" : 224
},
{
"token" : "是",
"start_offset" : 113,
"end_offset" : 114,
"type" : "<CJ>",
"position" : 225
},
{
"token" : "中",
"start_offset" : 114,
"end_offset" : 115,
"type" : "<CJ>",
"position" : 226
},
{
"token" : "国",
"start_offset" : 115,
"end_offset" : 116,
"type" : "<CJ>",
"position" : 227
},
{
"token" : "人",
"start_offset" : 116,
"end_offset" : 117,
"type" : "<CJ>",
"position" : 228
},
{
"token" : "[email protected]",
"start_offset" : 118,
"end_offset" : 129,
"type" : "<EMAIL>",
"position" : 329
},
{
"token" : "http",
"start_offset" : 130,
"end_offset" : 134,
"type" : "<ALPHANUM>",
"position" : 430
},
{
"token" : "www.baidu.com",
"start_offset" : 137,
"end_offset" : 150,
"type" : "<HOST>",
"position" : 431
},
{
"token" : "127.0.0.1",
"start_offset" : 151,
"end_offset" : 160,
"type" : "<HOST>",
"position" : 532
},
{
"token" : "2344232",
"start_offset" : 161,
"end_offset" : 168,
"type" : "<ALPHANUM>",
"position" : 633
},
{
"token" : "fdgfd",
"start_offset" : 169,
"end_offset" : 174,
"type" : "<ALPHANUM>",
"position" : 634
}
]
}
# Thai tokenizer
# 泰语分词器 - 用不上 - 先不介绍
GET _analyze
{
"tokenizer": "thai",
"text": "การที่ได้ต้องแสดงว่างานดี"
}
结果 :
[ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]