prometheus監控ElasticSearch核心名額

本文主要講述使用 Prometheus監控ES，梳理核心監控名額并建構 Dashboard ，當叢集有異常或者節點發生故障時，可以根據性能圖表以高效率的方式進行問題診斷，再對核心名額篩選添加告警。

Elasticsearch本身提供了大量的名額，可以幫助我們進行故障預檢，并在遇到諸如節點不可用、JVM OutOfMemoryError和垃圾回收時間過長等問題時采取必要措施。通常需要監控的幾個關鍵領域是：

查詢和索引（indexing）性能

記憶體配置設定和垃圾回收

主機級别的系統和網絡名額

叢集健康狀态和節點可用性

資源飽和度和相關錯誤

傳回參數

備注

metric name

status

叢集狀态，green（所有的主分片和副本分片都正常運作）、yellow（所有的主分片都正常運作，但不是所有的副本分片都正常運作）red（有主分片沒能正常運作）

elasticsearch_cluster_health_status

number_of_nodes/number_of_data_nodes

叢集節點數/資料節點數

elasticsearch_cluster_health_number_of_nodes/data_nodes

active_primary_shards

活躍的主分片總數

elasticsearch_cluster_health_active_primary_shards

active_shards

活躍的分片總數（包括複制分片）

elasticsearch_cluster_health_active_shards

relocating_shards

目前節點正在遷移到其他節點的分片數量，通常為0，叢集中有節點新加入或者退出時該值會增加

elasticsearch_cluster_health_relocating_shards

initializing_shards

正在初始化的分片

elasticsearch_cluster_health_initializing_shards

unassigned_shards

未配置設定的分片數，通常為0，當有節點的副本分片丢失該值會增加

elasticsearch_cluster_health_unassigned_shards

number_of_pending_tasks

隻有主節點能處理叢集級中繼資料的更改(建立索引，更新映射，配置設定分片等)，通過<code>pending-tasks</code> API可以檢視隊列中等待的任務，絕大部分情況下中繼資料更改的隊列基本上保持為零

elasticsearch_cluster_health_number_of_pending_tasks

依據上述監控項，配置叢集狀态Singlestat面闆，健康狀态一目了然

description

elasticsearch_process_cpu_percent

Percent CPU used by process CPU使用率

elasticsearch_filesystem_data_free_bytes

Free space on block device in bytes 磁盤可用空間

elasticsearch_process_open_files_count

Open file descriptors ES程序打開的檔案描述符

elasticsearch_transport_rx_packets_total

Count of packets receivedES節點之間網絡入流量

elasticsearch_transport_tx_packets_total

Count of packets sentES節點之間網絡出流量

如果CPU使用率持續增長，通常是由于大量的搜尋或索引工作造成的負載。可能需要添加更多的節點來重新配置設定負載。

檔案描述符用于節點間的通信、用戶端連接配接和檔案操作。如果打開的檔案描述符達到系統的限制（一般Linux運作每個程序有1024個檔案描述符，生産環境建議調大65535），新的連接配接和檔案操作将不可用，直到有舊的被關閉。

如果ES叢集是寫負載型，建議使用SSD盤，需要重點關注磁盤空間使用情況。當segment被建立、查詢和合并時，Elasticsearch會進行大量的磁盤讀寫操作。

節點之間的通信是衡量群集是否平衡的關鍵名額之一，可以通過發送和接收的位元組速率，來檢視叢集的網絡正在接收多少流量。

elasticsearch_jvm_gc_collection_seconds_count

Count of JVM GC runs垃圾搜集數

elasticsearch_jvm_gc_collection_seconds_sum

GC run time in seconds垃圾回收時間

elasticsearch_jvm_memory_committed_bytes

JVM memory currently committed by area最大使用記憶體限制

elasticsearch_jvm_memory_used_bytes

JVM memory currently used by area 記憶體使用量

主要關注JVM Heap 占用的記憶體以及JVM GC 所占的時間比例，定位是否有 GC 問題。Elasticsearch依靠垃圾回收來釋放堆棧記憶體，預設當JVM堆棧使用率達到75%的時候啟動垃圾回收，添加堆棧設定告警可以判斷目前垃圾回收的速度是否比産生速度快，若不能滿足需求，可以調整堆棧大小或者增加節點。

搜尋請求

elasticsearch_indices_search_query_total

query總數

elsticsearch_indices_search_query_time_seconds

query時間

elasticsearch_indices_search_fetch_total

fetch總數

elasticsearch_indices_search_fetch_time_seconds

fetch時間

索引請求

elasticsearch_indices_indexing_index_total

Total index calls索引index數

elasticsearch_indices_indexing_index_time_seconds_total

Cumulative index time in seconds累計index時間

elasticsearch_indices_refresh_total

Total time spent refreshing in second refresh時間

elasticsearch_indices_refresh_time_seconds_total

Total refreshess refresh數

elasticsearch_indices_flush_total

Total flushes flush數

elasticsearch_indices_flush_time_seconds

Cumulative flush time in seconds累計flush時間

将時間和操作數畫在同一張圖上，左邊y軸顯示時間，右邊y軸顯示對應操作計數，ops/time檢視平均操作耗時判斷性能是否異常。通過計算擷取平均索引延遲，如果延遲不斷增大，可能是一次性bulk了太多的文檔。

Elasticsearch通過flush操作将資料持久化到磁盤，如果flush延遲不斷增大，可能是磁盤IO能力不足，如果持續下去最終将導緻無法索引資料。

elasticsearch_thread_pool_queue_count

Thread Pool operations queued 線程池中排隊的線程數

elasticsearch_thread_pool_rejected_count

Thread Pool operations rejected 線程池中被拒絕的線程數

elasticsearch_indices_fielddata_memory_size_bytes

Field data cache memory usage in bytes fielddata緩存的大小

elasticsearch_indices_fielddata_evictions

Evictions from filter cache fielddata緩存的驅逐次數

elasticsearch_indices_filter_cache_memory_size_bytes

Filter cache memory usage in bytes 過濾器高速緩存的大小

elasticsearch_indices_filter_cache_evictions

Evictions from filter cache 過濾器緩存的驅逐次數

Cluster level changes which have not yet been executed 待處理任務數

elasticsearch_indices_get_missing_total

Total get missing 丢失檔案的請求數

elasticsearch_indices_get_missing_time_seconds

Total time of get missing in seconds 文檔丢失的請求時間

通過采集以上名額配置視圖，Elasticsearch節點使用線程池來管理線程對記憶體和CPU使用。可以通過請求隊列和請求被拒絕的情況，來确定節點是否夠用。

每個Elasticsearch節點都維護着很多類型的線程池。一般來講，最重要的幾個線程池是搜尋（search），索引（index），合并（merger）和批處理（bulk）。

每個線程池隊列的大小代表着目前節點有多少請求正在等待服務。一旦線程池達到最大隊列大小（不同類型的線程池的預設值不一樣），後面的請求都會被線程池拒絕。

prometheus監控ElasticSearch核心名額

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method