世界上并沒有完美的程式,但是我們并不是以而沮喪,因為寫程式就是一個不斷追求完美的過程。
自定義分析器 :
-
Character filters :
1. 作用 : 字元的增、删、改轉換
2. 數量限制 : 可以有0個或多個
3. 内建字元過濾器 :
1. HTML Strip Character filter : 去除html标簽
2. Mapping Character filter : 映射替換
3. Pattern Replace Character filter : 正則替換
-
Tokenizer :
1. 作用 :
1. 分詞
2. 記錄詞的順序和位置(短語查詢)
3. 記錄詞的開頭和結尾位置(高亮)
4. 記錄詞的類型(分類)
2. 數量限制 : 有且隻能有一個
3. 分類 :
1. 完整分詞 :
1. Standard
2. Letter
3. Lowercase
4. whitespace
5. UAX URL Email
6. Classic
7. Thai
2. 切詞 :
1. N-Gram
2. Edge N-Gram
3. 文本 :
1. Keyword
2. Pattern
3. Simple Pattern
4. Char Group
5. Simple Pattern split
6. Path
-
Token filters :
1. 作用 : 分詞的增、删、改轉換
2. 數量限制 : 可以有0個或多個
今天主要示範Tokenizer分類中的完整分詞的分詞器 :
# standard tokenizer
# 去除絕大部分符号
# 英文以詞分,中文以字分
# 配置項 :
# max_token_length - 每個詞項的最大長度,預設 255
GET /_analyze
{
"tokenizer": {
"type" : "standard",
"max_token_length" : 4
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中國人"
]
}
# 結果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "2",
"start_offset" : 16,
"end_offset" : 17,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "3",
"start_offset" : 18,
"end_offset" : 19,
"type" : "<NUM>",
"position" : 7
},
{
"token" : "5",
"start_offset" : 20,
"end_offset" : 21,
"type" : "<NUM>",
"position" : 8
},
{
"token" : "7",
"start_offset" : 22,
"end_offset" : 23,
"type" : "<NUM>",
"position" : 9
},
{
"token" : "p",
"start_offset" : 25,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "hell",
"start_offset" : 27,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "o",
"start_offset" : 31,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 12
},
{
"token" : "p",
"start_offset" : 34,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 13
},
{
"token" : "span",
"start_offset" : 38,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 14
},
{
"token" : "good",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 15
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "<ALPHANUM>",
"position" : 16
},
{
"token" : "The",
"start_offset" : 55,
"end_offset" : 58,
"type" : "<ALPHANUM>",
"position" : 117
},
{
"token" : "2",
"start_offset" : 59,
"end_offset" : 60,
"type" : "<NUM>",
"position" : 118
},
{
"token" : "QUIC",
"start_offset" : 61,
"end_offset" : 65,
"type" : "<ALPHANUM>",
"position" : 119
},
{
"token" : "K",
"start_offset" : 65,
"end_offset" : 66,
"type" : "<ALPHANUM>",
"position" : 120
},
{
"token" : "Brow",
"start_offset" : 67,
"end_offset" : 71,
"type" : "<ALPHANUM>",
"position" : 121
},
{
"token" : "n",
"start_offset" : 71,
"end_offset" : 72,
"type" : "<ALPHANUM>",
"position" : 122
},
{
"token" : "Foxe",
"start_offset" : 73,
"end_offset" : 77,
"type" : "<ALPHANUM>",
"position" : 123
},
{
"token" : "s",
"start_offset" : 77,
"end_offset" : 78,
"type" : "<ALPHANUM>",
"position" : 124
},
{
"token" : "jump",
"start_offset" : 79,
"end_offset" : 83,
"type" : "<ALPHANUM>",
"position" : 125
},
{
"token" : "ed",
"start_offset" : 83,
"end_offset" : 85,
"type" : "<ALPHANUM>",
"position" : 126
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "<ALPHANUM>",
"position" : 127
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "<ALPHANUM>",
"position" : 128
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "<ALPHANUM>",
"position" : 129
},
{
"token" : "dog",
"start_offset" : 100,
"end_offset" : 103,
"type" : "<ALPHANUM>",
"position" : 130
},
{
"token" : "s",
"start_offset" : 104,
"end_offset" : 105,
"type" : "<ALPHANUM>",
"position" : 131
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "<ALPHANUM>",
"position" : 132
},
{
"token" : "我",
"start_offset" : 112,
"end_offset" : 113,
"type" : "<IDEOGRAPHIC>",
"position" : 233
},
{
"token" : "是",
"start_offset" : 113,
"end_offset" : 114,
"type" : "<IDEOGRAPHIC>",
"position" : 234
},
{
"token" : "中",
"start_offset" : 114,
"end_offset" : 115,
"type" : "<IDEOGRAPHIC>",
"position" : 235
},
{
"token" : "國",
"start_offset" : 115,
"end_offset" : 116,
"type" : "<IDEOGRAPHIC>",
"position" : 236
},
{
"token" : "人",
"start_offset" : 116,
"end_offset" : 117,
"type" : "<IDEOGRAPHIC>",
"position" : 237
}
]
}
# letter tokenizer
# 英文分詞,中文不分詞
# 無符号
GET /_analyze
{
"tokenizer": {
"type" : "letter"
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中國人"]
}
# 結果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "word",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "word",
"position" : 5
},
{
"token" : "p",
"start_offset" : 25,
"end_offset" : 26,
"type" : "word",
"position" : 6
},
{
"token" : "hello",
"start_offset" : 27,
"end_offset" : 32,
"type" : "word",
"position" : 7
},
{
"token" : "p",
"start_offset" : 34,
"end_offset" : 35,
"type" : "word",
"position" : 8
},
{
"token" : "span",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 9
},
{
"token" : "good",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 10
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "word",
"position" : 11
},
{
"token" : "The",
"start_offset" : 55,
"end_offset" : 58,
"type" : "word",
"position" : 112
},
{
"token" : "QUICK",
"start_offset" : 61,
"end_offset" : 66,
"type" : "word",
"position" : 113
},
{
"token" : "Brown",
"start_offset" : 67,
"end_offset" : 72,
"type" : "word",
"position" : 114
},
{
"token" : "Foxes",
"start_offset" : 73,
"end_offset" : 78,
"type" : "word",
"position" : 115
},
{
"token" : "jumped",
"start_offset" : 79,
"end_offset" : 85,
"type" : "word",
"position" : 116
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "word",
"position" : 117
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "word",
"position" : 118
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "word",
"position" : 119
},
{
"token" : "dog",
"start_offset" : 100,
"end_offset" : 103,
"type" : "word",
"position" : 120
},
{
"token" : "s",
"start_offset" : 104,
"end_offset" : 105,
"type" : "word",
"position" : 121
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "word",
"position" : 122
},
{
"token" : "我是中國人",
"start_offset" : 112,
"end_offset" : 117,
"type" : "word",
"position" : 223
}
]
}
# lowercase tokenizer
# 與 letter tokenizer 類似
# 轉小寫
GET /_analyze
{
"tokenizer": {
"type" : "lowercase"
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中國人"]
}
# 結果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "word",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "word",
"position" : 5
},
{
"token" : "p",
"start_offset" : 25,
"end_offset" : 26,
"type" : "word",
"position" : 6
},
{
"token" : "hello",
"start_offset" : 27,
"end_offset" : 32,
"type" : "word",
"position" : 7
},
{
"token" : "p",
"start_offset" : 34,
"end_offset" : 35,
"type" : "word",
"position" : 8
},
{
"token" : "span",
"start_offset" : 38,
"end_offset" : 42,
"type" : "word",
"position" : 9
},
{
"token" : "good",
"start_offset" : 43,
"end_offset" : 47,
"type" : "word",
"position" : 10
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "word",
"position" : 11
},
{
"token" : "the",
"start_offset" : 55,
"end_offset" : 58,
"type" : "word",
"position" : 112
},
{
"token" : "quick",
"start_offset" : 61,
"end_offset" : 66,
"type" : "word",
"position" : 113
},
{
"token" : "brown",
"start_offset" : 67,
"end_offset" : 72,
"type" : "word",
"position" : 114
},
{
"token" : "foxes",
"start_offset" : 73,
"end_offset" : 78,
"type" : "word",
"position" : 115
},
{
"token" : "jumped",
"start_offset" : 79,
"end_offset" : 85,
"type" : "word",
"position" : 116
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "word",
"position" : 117
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "word",
"position" : 118
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "word",
"position" : 119
},
{
"token" : "dog",
"start_offset" : 100,
"end_offset" : 103,
"type" : "word",
"position" : 120
},
{
"token" : "s",
"start_offset" : 104,
"end_offset" : 105,
"type" : "word",
"position" : 121
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "word",
"position" : 122
},
{
"token" : "我是中國人",
"start_offset" : 112,
"end_offset" : 117,
"type" : "word",
"position" : 223
}
]
}
# whitespace tokenizer
# 空格分詞
# 配置項 :
# max_token_length
GET /_analyze
{
"tokenizer": {
"type" : "whitespace",
"max_token_length" : 4
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中國人"]
}
# 結果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "word",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "word",
"position" : 5
},
{
"token" : "2-3-",
"start_offset" : 16,
"end_offset" : 20,
"type" : "word",
"position" : 6
},
{
"token" : "5-7",
"start_offset" : 20,
"end_offset" : 23,
"type" : "word",
"position" : 7
},
{
"token" : "<p>h",
"start_offset" : 24,
"end_offset" : 28,
"type" : "word",
"position" : 8
},
{
"token" : "ello",
"start_offset" : 28,
"end_offset" : 32,
"type" : "word",
"position" : 9
},
{
"token" : "</p>",
"start_offset" : 32,
"end_offset" : 36,
"type" : "word",
"position" : 10
},
{
"token" : "<spa",
"start_offset" : 37,
"end_offset" : 41,
"type" : "word",
"position" : 11
},
{
"token" : "n>go",
"start_offset" : 41,
"end_offset" : 45,
"type" : "word",
"position" : 12
},
{
"token" : "od</",
"start_offset" : 45,
"end_offset" : 49,
"type" : "word",
"position" : 13
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "word",
"position" : 14
},
{
"token" : ">",
"start_offset" : 53,
"end_offset" : 54,
"type" : "word",
"position" : 15
},
{
"token" : "The",
"start_offset" : 55,
"end_offset" : 58,
"type" : "word",
"position" : 116
},
{
"token" : "2",
"start_offset" : 59,
"end_offset" : 60,
"type" : "word",
"position" : 117
},
{
"token" : "QUIC",
"start_offset" : 61,
"end_offset" : 65,
"type" : "word",
"position" : 118
},
{
"token" : "K",
"start_offset" : 65,
"end_offset" : 66,
"type" : "word",
"position" : 119
},
{
"token" : "Brow",
"start_offset" : 67,
"end_offset" : 71,
"type" : "word",
"position" : 120
},
{
"token" : "n-Fo",
"start_offset" : 71,
"end_offset" : 75,
"type" : "word",
"position" : 121
},
{
"token" : "xes",
"start_offset" : 75,
"end_offset" : 78,
"type" : "word",
"position" : 122
},
{
"token" : "jump",
"start_offset" : 79,
"end_offset" : 83,
"type" : "word",
"position" : 123
},
{
"token" : "ed",
"start_offset" : 83,
"end_offset" : 85,
"type" : "word",
"position" : 124
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "word",
"position" : 125
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "word",
"position" : 126
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "word",
"position" : 127
},
{
"token" : "dog'",
"start_offset" : 100,
"end_offset" : 104,
"type" : "word",
"position" : 128
},
{
"token" : "s",
"start_offset" : 104,
"end_offset" : 105,
"type" : "word",
"position" : 129
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "word",
"position" : 130
},
{
"token" : ".",
"start_offset" : 110,
"end_offset" : 111,
"type" : "word",
"position" : 131
},
{
"token" : "我是中國",
"start_offset" : 112,
"end_offset" : 116,
"type" : "word",
"position" : 232
},
{
"token" : "人",
"start_offset" : 116,
"end_offset" : 117,
"type" : "word",
"position" : 233
}
]
}
# UAX URL Email tokenizer
# 與standard tokenizer類似
# 能夠識别出url和email
# 配置項 :
# max_token_length
GET /_analyze
{
"tokenizer": {
"type" : "uax_url_email"
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中國人",
"[email protected]",
"http://www.baidu.com"]
}
# 結果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "2",
"start_offset" : 16,
"end_offset" : 17,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "3",
"start_offset" : 18,
"end_offset" : 19,
"type" : "<NUM>",
"position" : 7
},
{
"token" : "5",
"start_offset" : 20,
"end_offset" : 21,
"type" : "<NUM>",
"position" : 8
},
{
"token" : "7",
"start_offset" : 22,
"end_offset" : 23,
"type" : "<NUM>",
"position" : 9
},
{
"token" : "p",
"start_offset" : 25,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "hello",
"start_offset" : 27,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "p",
"start_offset" : 34,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 12
},
{
"token" : "span",
"start_offset" : 38,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 13
},
{
"token" : "good",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 14
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "<ALPHANUM>",
"position" : 15
},
{
"token" : "The",
"start_offset" : 55,
"end_offset" : 58,
"type" : "<ALPHANUM>",
"position" : 116
},
{
"token" : "2",
"start_offset" : 59,
"end_offset" : 60,
"type" : "<NUM>",
"position" : 117
},
{
"token" : "QUICK",
"start_offset" : 61,
"end_offset" : 66,
"type" : "<ALPHANUM>",
"position" : 118
},
{
"token" : "Brown",
"start_offset" : 67,
"end_offset" : 72,
"type" : "<ALPHANUM>",
"position" : 119
},
{
"token" : "Foxes",
"start_offset" : 73,
"end_offset" : 78,
"type" : "<ALPHANUM>",
"position" : 120
},
{
"token" : "jumped",
"start_offset" : 79,
"end_offset" : 85,
"type" : "<ALPHANUM>",
"position" : 121
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "<ALPHANUM>",
"position" : 122
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "<ALPHANUM>",
"position" : 123
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "<ALPHANUM>",
"position" : 124
},
{
"token" : "dog's",
"start_offset" : 100,
"end_offset" : 105,
"type" : "<ALPHANUM>",
"position" : 125
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "<ALPHANUM>",
"position" : 126
},
{
"token" : "我",
"start_offset" : 112,
"end_offset" : 113,
"type" : "<IDEOGRAPHIC>",
"position" : 227
},
{
"token" : "是",
"start_offset" : 113,
"end_offset" : 114,
"type" : "<IDEOGRAPHIC>",
"position" : 228
},
{
"token" : "中",
"start_offset" : 114,
"end_offset" : 115,
"type" : "<IDEOGRAPHIC>",
"position" : 229
},
{
"token" : "國",
"start_offset" : 115,
"end_offset" : 116,
"type" : "<IDEOGRAPHIC>",
"position" : 230
},
{
"token" : "人",
"start_offset" : 116,
"end_offset" : 117,
"type" : "<IDEOGRAPHIC>",
"position" : 231
},
{
"token" : "[email protected]",
"start_offset" : 118,
"end_offset" : 129,
"type" : "<EMAIL>",
"position" : 332
},
{
"token" : "http://www.baidu.com",
"start_offset" : 130,
"end_offset" : 150,
"type" : "<URL>",
"position" : 433
}
]
}
# classic tokenizer
# 适用于英文
# 能識别 email host 縮寫 公司名 等
# 配置項 :
# max_token_length
GET /_analyze
{
"tokenizer": {
"type" : "classic"
},
"text": ["this is a b c d 2-3-5-7 <p>hello</p> <span>good</span>",
"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.",
"我是中國人",
"[email protected]",
"http://www.baidu.com",
"127.0.0.1",
"2344232 fdgfd"]
}
# 結果
{
"tokens" : [
{
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "a",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "b",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "c",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "d",
"start_offset" : 14,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "2-3-5-7",
"start_offset" : 16,
"end_offset" : 23,
"type" : "<NUM>",
"position" : 6
},
{
"token" : "p",
"start_offset" : 25,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "hello",
"start_offset" : 27,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "p",
"start_offset" : 34,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "span",
"start_offset" : 38,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "good",
"start_offset" : 43,
"end_offset" : 47,
"type" : "<ALPHANUM>",
"position" : 11
},
{
"token" : "span",
"start_offset" : 49,
"end_offset" : 53,
"type" : "<ALPHANUM>",
"position" : 12
},
{
"token" : "The",
"start_offset" : 55,
"end_offset" : 58,
"type" : "<ALPHANUM>",
"position" : 113
},
{
"token" : "2",
"start_offset" : 59,
"end_offset" : 60,
"type" : "<ALPHANUM>",
"position" : 114
},
{
"token" : "QUICK",
"start_offset" : 61,
"end_offset" : 66,
"type" : "<ALPHANUM>",
"position" : 115
},
{
"token" : "Brown",
"start_offset" : 67,
"end_offset" : 72,
"type" : "<ALPHANUM>",
"position" : 116
},
{
"token" : "Foxes",
"start_offset" : 73,
"end_offset" : 78,
"type" : "<ALPHANUM>",
"position" : 117
},
{
"token" : "jumped",
"start_offset" : 79,
"end_offset" : 85,
"type" : "<ALPHANUM>",
"position" : 118
},
{
"token" : "over",
"start_offset" : 86,
"end_offset" : 90,
"type" : "<ALPHANUM>",
"position" : 119
},
{
"token" : "the",
"start_offset" : 91,
"end_offset" : 94,
"type" : "<ALPHANUM>",
"position" : 120
},
{
"token" : "lazy",
"start_offset" : 95,
"end_offset" : 99,
"type" : "<ALPHANUM>",
"position" : 121
},
{
"token" : "dog's",
"start_offset" : 100,
"end_offset" : 105,
"type" : "<APOSTROPHE>",
"position" : 122
},
{
"token" : "bone",
"start_offset" : 106,
"end_offset" : 110,
"type" : "<ALPHANUM>",
"position" : 123
},
{
"token" : "我",
"start_offset" : 112,
"end_offset" : 113,
"type" : "<CJ>",
"position" : 224
},
{
"token" : "是",
"start_offset" : 113,
"end_offset" : 114,
"type" : "<CJ>",
"position" : 225
},
{
"token" : "中",
"start_offset" : 114,
"end_offset" : 115,
"type" : "<CJ>",
"position" : 226
},
{
"token" : "國",
"start_offset" : 115,
"end_offset" : 116,
"type" : "<CJ>",
"position" : 227
},
{
"token" : "人",
"start_offset" : 116,
"end_offset" : 117,
"type" : "<CJ>",
"position" : 228
},
{
"token" : "[email protected]",
"start_offset" : 118,
"end_offset" : 129,
"type" : "<EMAIL>",
"position" : 329
},
{
"token" : "http",
"start_offset" : 130,
"end_offset" : 134,
"type" : "<ALPHANUM>",
"position" : 430
},
{
"token" : "www.baidu.com",
"start_offset" : 137,
"end_offset" : 150,
"type" : "<HOST>",
"position" : 431
},
{
"token" : "127.0.0.1",
"start_offset" : 151,
"end_offset" : 160,
"type" : "<HOST>",
"position" : 532
},
{
"token" : "2344232",
"start_offset" : 161,
"end_offset" : 168,
"type" : "<ALPHANUM>",
"position" : 633
},
{
"token" : "fdgfd",
"start_offset" : 169,
"end_offset" : 174,
"type" : "<ALPHANUM>",
"position" : 634
}
]
}
# Thai tokenizer
# 泰語分詞器 - 用不上 - 先不介紹
GET _analyze
{
"tokenizer": "thai",
"text": "การที่ได้ต้องแสดงว่างานดี"
}
結果 :
[ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]