聚合類似于 MySQL 中的 group by 分組, Elasticsearch 支援通過聚合函數 (count、sum、max、min、avg等) 進行統計分析.
具體要如何通過ES進行聚合統計、嵌套聚合等操作呢? 本文通過系列案例, 進行比較詳細的示範, 一起來學習交流吧^_^
目錄
- 1 普通聚合分析
- 1.1 直接聚合統計
- 1.2 先檢索, 再聚合
- 1.3 擴充: fielddata和keyword的聚合比較
- 2 嵌套聚合
- 2.1 先分組, 再聚合統計
- 2.2 先分組, 再統計, 最後排序
- 2.3 先分組, 組内再分組, 然後統計、排序
- 版權聲明
(1) 計算每個tag下的文檔數量, 請求文法:
GET book_shop/it_book/_search
{
"size": 0, // 不顯示命中(hits)的所有文檔資訊
"aggs": {
"group_by_tags": { // 聚合結果的名稱, 需要自定義(複制時請去掉此注釋)
"terms": {
"field": "tags"
}
}
}
}
(2) 發生錯誤:
說明: 索引book_shop的mapping映射是ES自動建立的, 它把tag解析成了text類型, 在發起對tag的聚合請求後, 将抛出如下錯誤:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [tags] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [......]
},
"status": 400
}
(3) 錯誤分析:
錯誤資訊:
Set fielddata=true on [xxxx] ......
錯誤分析: 預設情況下, Elasticsearch 對 text 類型的字段(field)禁用了 fielddata;
text 類型的字段在建立索引時會進行分詞處理, 而聚合操作必須基于字段的原始值進行分析;
是以如果要對 text 類型的字段進行聚合操作, 就需要存儲其原始值 —— 建立mapping時指定
, 以便通過反轉反向索引(即正排索引)将索引資料加載至記憶體中.
fielddata=true
(4) 解決方案一: 對text類型的字段開啟fielddata屬性:
- 将要分組統計的text field(即tags)的fielddata設定為true:
PUT book_shop/_mapping/it_book { "properties": { "tags": { "type": "text", "fielddata": true } } }
-
可參考官方文檔進行設定:
https://www.elastic.co/guide/en/elasticsearch/reference/6.6/fielddata.html. 成功後的結果如下:
{ "acknowledged": true }
- 再次統計, 得到的結果如下:
{ "took": 153, "timed_out": false, "_shards": { "total": 5, "successful": 5, "skipped": 0, "failed": 0 }, "hits": { "total": 4, "max_score": 0.0, "hits": [] }, "aggregations": { "group_by_tags": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 6, "buckets": [ { "key": "java", "doc_count": 3 }, { "key": "程", "doc_count": 2 }, ...... ] } } }
(5) 解決方法二: 使用内置keyword字段:
- 開啟fielddata将占用大量的記憶體.
- Elasticsearch 5.x 版本開始支援通過text的内置字段keyword作精确查詢、聚合分析:
GET shop/it_book/_search { "size": 0, "aggs": { "group_by_tags": { "terms": { "field": "tags.keyword" // 使用text類型的内置keyword字段 } } } }
(1) 統計name中含有“jvm”的圖書中每個tag的文檔數量, 請求文法:
GET book_shop/it_book/_search
{
"query": {
"match": { "name": "jvm" }
},
"aggs": {
"group_by_tags": { // 聚合結果的名稱, 需要自定義. 下面使用内置的keyword字段:
"terms": { "field": "tags.keyword" }
}
}
}
(2) 響應結果:
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.64072424,
"hits" : [
{
"_index" : "book_shop",
"_type" : "it_book",
"_id" : "2",
"_score" : 0.64072424,
"_source" : {
"name" : "深入了解Java虛拟機:JVM進階特性與最佳實踐",
"author" : "周志明",
"category" : "程式設計語言",
"desc" : "Java圖書領域公認的經典著作",
"price" : 79.0,
"date" : "2013-10-01",
"publisher" : "機械工業出版社",
"tags" : [
"Java",
"虛拟機",
"最佳實踐"
]
}
}
]
},
"aggregations" : {
"group_by_tags" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java",
"doc_count" : 1
},
{
"key" : "最佳實踐",
"doc_count" : 1
},
{
"key" : "虛拟機",
"doc_count" : 1
}
]
}
}
}
- 為某個 text 類型的字段開啟fielddata字段後, 聚合分析操作會對這個字段的所有分詞分别進行聚合, 獲得的結果大多數情況下并不符合我們的需求.
- 使用keyword内置字段, 不會對相關的分詞進行聚合, 結果可能更有用.
—— 推薦使用text類型字段的内置keyword進行聚合操作.
(1) 先按tags分組, 再計算每個tag下圖書的平均價格, 請求文法:
GET book_shop/it_book/_search
{
"size": 0,
"aggs": {
"group_by_tags": {
"terms": { "field": "tags.keyword" },
"aggs": {
"avg_price": {
"avg": { "field": "price" }
}
}
}
}
}
"hits" : {
"total" : 3,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group_by_tags" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java",
"doc_count" : 3,
"avg_price" : {
"value" : 102.33333333333333
}
},
{
"key" : "程式設計語言",
"doc_count" : 2,
"avg_price" : {
"value" : 114.0
}
},
......
]
}
}
(1) 計算每個tag下圖書的平均價格, 再按平均價格降序排序, 查詢文法:
GET book_shop/it_book/_search
{
"size": 0,
"aggs": {
"all_tags": {
"terms": {
"field": "tags.keyword",
"order": { "avg_price": "desc" } // 根據下述統計的結果排序
},
"aggs": {
"avg_price": {
"avg": { "field": "price" }
}
}
}
}
}
與#2.1節内容相似, 差別在于按照價格排序顯示了.
(1) 先按價格區間分組, 組内再按tags分組, 計算每個tags組的平均價格, 查詢文法:
GET book_shop/it_book/_search
{
"size": 0,
"aggs": {
"group_by_price": {
"range": {
"field": "price",
"ranges": [
{ "from": 00, "to": 100 },
{ "from": 100, "to": 150 }
]
},
"aggs": {
"group_by_tags": {
"terms": { "field": "tags.keyword" },
"aggs": {
"avg_price": {
"avg": { "field": "price" }
}
}
}
}
}
}
}
"hits" : {
"total" : 3,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"group_by_price" : {
"buckets" : [
{
"key" : "0.0-100.0", // 區間0.0-100.0
"from" : 0.0,
"to" : 100.0,
"doc_count" : 1, // 共查找到了3條文檔
"group_by_tags" : { // 對tags分組聚合
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java",
"doc_count" : 1,
"avg_price" : {
"value" : 79.0
}
},
......
]
}
},
{
"key" : "100.0-150.0",
"from" : 100.0,
"to" : 150.0,
"doc_count" : 2,
"group_by_tags" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Java",
"doc_count" : 2,
"avg_price" : {
"value" : 114.0
}
},
......
}
]
}
}
]
}
}
作者: 馬瘦風(https://healchow.com)
出處: 部落格園 馬瘦風的部落格(https://www.cnblogs.com/shoufeng)
感謝閱讀, 如果文章有幫助或啟發到你, 點個[好文要頂👆] 或 [推薦👍] 吧😜
本文版權歸部落客所有, 歡迎轉載, 但 [必須在文章頁面明顯位置标明原文連結], 否則部落客保留追究相關人員法律責任的權利.