天天看點

Elasticsearch 聚合的精準度問題

文章目錄

  • ​​1. 分布式系統的近似統計算法#​​
  • ​​2. Min 聚合分析的執行流程​​
  • ​​3. terms Aggregation 的傳回值​​
  • ​​4. Terms 聚合分析的執行流程​​
  • ​​5. Terms 不正确的案例​​
  • ​​6. 如何解決 Terms 不準的問題:提升 shard_size 的參數​​
  • ​​7. 打開 show_term_doc_count_error​​
  • ​​8. shard_size 設定​​
  • ​​9. demo​​
  • ​​9.1 插入資料​​
  • ​​9.2 将索引kibana_sample_data_flights資料導入my_flights​​

1. 分布式系統的近似統計算法#

Elasticsearch 聚合的精準度問題

2. Min 聚合分析的執行流程

Elasticsearch 聚合的精準度問題

3. terms Aggregation 的傳回值

在 Terms Aggregation 的傳回中有兩個特殊的數值

  • ​doc_count_error_upper_bound​

    ​:被遺漏的 term 分桶,包含的文檔,有可能的最大值
  • ​sum_other_doc_count​

    ​​: 處理傳回結果 bucket 的 terms 以外,其他 terms 的文檔總數(總數 -傳回的總數)
    Elasticsearch 聚合的精準度問題

4. Terms 聚合分析的執行流程

Elasticsearch 聚合的精準度問題

5. Terms 不正确的案例

Elasticsearch 聚合的精準度問題

6. 如何解決 Terms 不準的問題:提升 shard_size 的參數

  • Terms 聚合分析不準的原因,資料分散在多個分片上,Coordinating Node 無法擷取資料全貌
  • 解決方案 1:當資料量不大時,設定​

    ​Primary Shard​

    ​ 為 1;實作準确性
  • 解決方案 2:在分布式資料上,設定​

    ​shard_size​

    ​ 參數,提高精确度
  • 原理:每次從 Shard 上額外多擷取資料,提升準确率
    Elasticsearch 聚合的精準度問題

7. 打開 show_term_doc_count_error

Elasticsearch 聚合的精準度問題

8. shard_size 設定

  • 增加整體計算量,提高了準确率,但會降低相應時間
  • shard size = size * 1.5 +10

9. demo

9.1 插入資料

DELETE my_flights
PUT my_flights
{
  "settings": {
    "number_of_shards": 20
  },
  "mappings" : {
      "properties" : {
        "AvgTicketPrice" : {
          "type" : "float"
        },
        "Cancelled" : {
          "type" : "boolean"
        },
        "Carrier" : {
          "type" : "keyword"
        },
        "Dest" : {
          "type" : "keyword"
        },
        "DestAirportID" : {
          "type" : "keyword"
        },
        "DestCityName" : {
          "type" : "keyword"
        },
        "DestCountry" : {
          "type" : "keyword"
        },
        "DestLocation" : {
          "type" : "geo_point"
        },
        "DestRegion" : {
          "type" : "keyword"
        },
        "DestWeather" : {
          "type" : "keyword"
        },
        "DistanceKilometers" : {
          "type" : "float"
        },
        "DistanceMiles" : {
          "type" : "float"
        },
        "FlightDelay" : {
          "type" : "boolean"
        },
        "FlightDelayMin" : {
          "type" : "integer"
        },
        "FlightDelayType" : {
          "type" : "keyword"
        },
        "FlightNum" : {
          "type" : "keyword"
        },
        "FlightTimeHour" : {
          "type" : "keyword"
        },
        "FlightTimeMin" : {
          "type" : "float"
        },
        "Origin" : {
          "type" : "keyword"
        },
        "OriginAirportID" : {
          "type" : "keyword"
        },
        "OriginCityName" : {
          "type" : "keyword"
        },
        "OriginCountry" : {
          "type" : "keyword"
        },
        "OriginLocation" : {
          "type" : "geo_point"
        },
        "OriginRegion" : {
          "type" : "keyword"
        },
        "OriginWeather" : {
          "type" : "keyword"
        },
        "dayOfWeek" : {
          "type" : "integer"
        },
        "timestamp" : {
          "type" : "date"
        }
      }
    }
}      

9.2 将索引kibana_sample_data_flights資料導入my_flights

POST _reindex
{
  "source": {
    "index": "kibana_sample_data_flights"
  },
  "dest": {
    "index": "my_flights"
  }
}

傳回輸出
{
  "took" : 3221,
  "timed_out" : false,
  "total" : 13059,
  "updated" : 0,
  "created" : 13059,
  "deleted" : 0,
  "batches" : 14,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}      
GET kibana_sample_data_flights/_count
GET my_flights/_count
傳回輸出:
{
  "count" : 13059,
  "_shards" : {
    "total" : 20,
    "successful" : 20,
    "skipped" : 0,
    "failed" : 0
  }
}


get kibana_sample_data_flights/_search      
GET kibana_sample_data_flights/_search
{
  "size": 0,
  "aggs": {
    "weather": {
      "terms": {
        "field":"OriginWeather",
        "size":5,
        "show_term_doc_count_error":true
      }
    }
  }
}

傳回輸出:
{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "weather" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 2932,
      "buckets" : [
        {
          "key" : "Clear",
          "doc_count" : 2324,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "Cloudy",
          "doc_count" : 2319,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "Rain",
          "doc_count" : 2214,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "Sunny",
          "doc_count" : 2209,
          "doc_count_error_upper_bound" : 0
        },
        {
          "key" : "Thunder & Lightning",
          "doc_count" : 1061,
          "doc_count_error_upper_bound" : 0
        }
      ]
    }
  }
}      
GET my_flights/_search
{
  "size": 0,
  "aggs": {
    "weather": {
      "terms": {
        "field":"OriginWeather",
        "size":1,
        "shard_size":1,
        "show_term_doc_count_error":true
      }
    }
  }
}

傳回輸出:
{
  "took" : 18,
  "timed_out" : false,
  "_shards" : {
    "total" : 20,
    "successful" : 20,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "weather" : {
      "doc_count_error_upper_bound" : 2511,
      "sum_other_doc_count" : 12022,
      "buckets" : [
        {
          "key" : "Clear",
          "doc_count" : 1037,
          "doc_count_error_upper_bound" : 1474
        }
      ]
    }
  }
}      

繼續閱讀