基于ELK搭建MySQL日志平台的要點和常見錯誤

第一部分概括

資料，讓一切有迹可循，讓一切有源可溯。ELK是集分布式資料存儲、可視化查詢和日志解析于一體的日志分析平台。ELK=elasticsearch+Logstash+kibana，三者各司其職，互相配合，共同完成日志的資料處理工作。ELK各元件的主要功能如下：

elasticsearch，資料存儲以及全文檢索；
logstash，日志加工、“搬運工”；
kibana：資料可視化展示和運維管理。

我們在搭建平台時，還借助了filebeat插件。Filebeat是本地檔案的日志資料采集器，可監控日志目錄或特定日志檔案（tail file），并可将資料轉發給Elasticsearch或Logstatsh等。

本案例的實踐，主要通過ELK收集、管理、檢索mysql執行個體的慢查詢日志和錯誤日志。

簡單的資料流程圖如下：

第二部分 elasticsearch

2.1 ES特點和優勢

分布式實時檔案存儲，可将每一個字段存入索引，使其可以被檢索到。
實時分析的分布式搜尋引擎。分布式：索引分拆成多個分片，每個分片可有零個或多個副本；負載再平衡和路由在大多數情況下自動完成。
可以擴充到上百台伺服器，處理PB級别的結構化或非結構化資料。也可以運作在單台PC上。
支援插件機制，分詞插件、同步插件、Hadoop插件、可視化插件等。

2.2 ES主要概念

ES資料庫	MySQL資料庫
Index	Database
Tpye[在7.0之後type為固定值_doc]	Table
Document	Row
Field	Column
Mapping	Schema
Everything is indexed
Query DSL[Descriptor structure language]	SQL
GET http://...	Select * from table …
PUT http://...	Update table set …

關系型資料庫中的資料庫（DataBase），等價于ES中的索引（Index）;
一個關系型資料庫有N張表（Table），等價于1個索引Index下面有N多類型（Type）;
一個資料庫表（Table）下的資料由多行（ROW）多列（column，屬性）組成，等價于1個Type由多個文檔（Document）和多Field組成;
在關系型資料庫裡，schema定義了表、每個表的字段，還有表和字段之間的關系。與之對應的，在ES中：Mapping定義索引下的Type的字段處理規則，即索引如何建立、索引類型、是否儲存原始索引JSON文檔、是否壓縮原始JSON文檔、是否需要分詞處理、如何進行分詞處理等;
關系型資料庫中的增insert、删delete、改update、查search操作等價于ES中的增PUT/POST、删Delete、改_update、查GET.

2.3 執行權限問題

報錯提示

[usernimei@testes01 bin]$ Exception in thread "main" org.elasticsearch.bootstrap.BootstrapException: java.nio.file.AccessDeniedException: /data/elasticsearch/elasticsearch-7.4.2/config/elasticsearch.keystore
Likely root cause: java.nio.file.AccessDeniedException: /data/elasticsearch/elasticsearch-7.4.2/config/elasticsearch.keystore
    at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
    at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219)
    at java.base/java.nio.file.Files.newByteChannel(Files.java:374)
    at java.base/java.nio.file.Files.newByteChannel(Files.java:425)
    at org.apache.lucene.store.SimpleFSDirectory.openInput(SimpleFSDirectory.java:77)
    at org.elasticsearch.common.settings.KeyStoreWrapper.load(KeyStoreWrapper.java:219)
    at org.elasticsearch.bootstrap.Bootstrap.loadSecureSettings(Bootstrap.java:234)
    at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:305)
    at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159)
    at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150)
    at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
    at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:125)
    at org.elasticsearch.cli.Command.main(Command.java:90)
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115)
    at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
Refer to the log for complete error details

問題分析

第一次誤用了root賬号啟動，此時路徑下的elasticsearch.keystore 權限屬于了root

-rw-rw---- 1 root      root        199 Mar 24 17:36 elasticsearch.keystore

解決方案--切換到root使用者修改檔案elasticsearch.keystore權限

調整到es使用者下，即

chown -R es使用者:es使用者組 elasticsearch.keystore

問題2.4 maximum shards open 問題

根據官方解釋，從Elasticsearch v7.0.0 開始，叢集中的每個節點預設限制 1000 個shard，如果你的es叢集有3個資料節點，那麼最多 3000 shards。這裡我們是隻有一台es。是以隻有1000。

[2019-05-11T11:05:24,650][WARN ][logstash.outputs.elasticsearch][main] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://qqelastic:[email protected]:55944/][Manticore::SocketTimeout] Read timed out {:url=>http://qqelastic:[email protected]:55944/, :error_message=>"Elasticsearch Unreachable: [http://qqelastic:[email protected]:55944/][Manticore::SocketTimeout] Read timed out", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}
[2019-05-11T11:05:24,754][ERROR][logstash.outputs.elasticsearch][main] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://qqelastic:[email protected]:55944/][Manticore::SocketTimeout] Read timed out", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}
[2019-05-11T11:05:25,158][WARN ][logstash.outputs.elasticsearch][main] Restored connection to ES instance {:url=>"http://qqelastic:[email protected]:55944/"}
[2019-05-11T11:05:26,763][WARN ][logstash.outputs.elasticsearch][main] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"mysql-error-testqq-2019.05.11", :routing=>nil, :_type=>"_doc"}, #<LogStash::Event:0x65416fce>], :response=>{"index"=>{"_index"=>"mysql-error-qqweixin-2020.05.11", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"validation_exception", "reason"=>"Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [1000]/[1000] maximum shards open;"}}}}

可以用Kibana來設定

主要指令：

PUT /_cluster/settings
{
  "transient": {
    "cluster": {
      "max_shards_per_node":10000
    }
  }
}

操作截圖如下：

注意事項：

建議設定後重新開機下lostash服務

問題 2.5 Too Many Requests/circuit_breaking_exception

随着ES存儲資料的增多，在打開kibana執行查詢時報錯，提示資訊如下：

{
    "statusCode":429,
    "error":"Too Many Requests",
    "message":"[circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [987817048/942mb], which is larger than the limit of [986061209/940.3mb], real usage: [987817048/942mb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=966440/943.7kb, in_flight_requests=0/0b, accounting=47842100/45.6mb], with { bytes_wanted=987817048 & bytes_limit=986061209 & durability="PERMANENT" }"
}

尋找解決方案為在執行個體的config/jvm.options設定如下:

-Xms2g
-Xmx2g

#-XX:+UseConcMarkSweepGC
-XX:+UseG1GC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

注釋：-Xmx2g 修改前為 -Xms1g;-XX:+UseG1GC 為新增參數選項。
但是此時啟動報錯:

Exception in thread "main" java.lang.RuntimeException: starting java failed with [1]
output:
Error occurred during initialization of VM
Multiple garbage collectors selected
error:
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
        at org.elasticsearch.tools.launchers.JvmErgonomics.flagsFinal(JvmErgonomics.java:111)
        at org.elasticsearch.tools.launchers.JvmErgonomics.finalJvmOptions(JvmErgonomics.java:79)
        at org.elasticsearch.tools.launchers.JvmErgonomics.choose(JvmErgonomics.java:57)
        at org.elasticsearch.tools.launchers.JvmOptionsParser.main(JvmOptionsParser.java:89)

解決方案：

去除新增的-XX:+UseG1GC設定。

最終的設定為:

################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms2g
-Xmx2g

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration

2.6 問題 index has exceeded [1000000] - maximum allowed to be analyzed for highlighting

The length of [sql_stmt] field of [uMF8g3kBM_phpUO8CEYw] doc of [mysql-XXXXXXX] index has exceeded [1000000] - maximum allowed to be analyzed for highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!

解決方案

PUT /mysql-*/_settings
{
    "index" : {
        "highlight.max_analyzed_offset" : 10000000
    }
}

第三部分 Filebeat

問題3.1 不讀取log檔案中的資料

2019-03-23T19:24:41.772+0800    INFO    [monitoring]    log/log.go:145    Non-zero metrics in the last 30s   
 {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":30,"time":{"ms":2}},"total":{"ticks":80,"time":{"ms":4},"value":80},"user":{"ticks":50,"time":{"ms":2}}},"handles":{"limit":{"hard":1000000,"soft":1000000},"open":6},"info":{"ephemeral_id":"a4c61321-ad02-2c64-9624-49fe4356a4e9","uptime":{"ms":210031}},"memstats":{"gc_next":7265376,"memory_alloc":4652416,"memory_total":12084992},"runtime":{"goroutines":16}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":0,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0,"15":0.05,"5":0.01,"norm":{"1":0,"15":0.0125,"5":0.0025}}}}}}

修改 filebeat.yml 的配置參數

問題3.2 多個服務程序

2019-03-27T20:13:22.985+0800    ERROR    logstash/async.go:256    Failed to publish events caused by: write tcp [::1]:48338->[::1]:5044: write: connection reset by peer
2019-03-27T20:13:23.985+0800    INFO    [monitoring]    log/log.go:145    Non-zero metrics in the last 30s    {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":130,"time":{"ms":11}},"total":{"ticks":280,"time":{"ms":20},"value":280},"user":{"ticks":150,"time":{"ms":9}}},"handles":{"limit":{"hard":65536,"soft":65536},"open":7},"info":{"ephemeral_id":"a02ed909-a7a0-49ee-aff9-5fdab26ecf70","uptime":{"ms":150065}},"memstats":{"gc_next":10532480,"memory_alloc":7439504,"memory_total":19313416,"rss":806912},"runtime":{"goroutines":27}},"filebeat":{"events":{"active":1,"added":1},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"batches":1,"failed":1,"total":1},"write":{"errors":1}},"pipeline":{"clients":1,"events":{"active":1,"published":1,"total":1}}},"registrar":{"states":{"current":1}},"system":{"load":{"1":0.05,"15":0.11,"5":0.06,"norm":{"1":0.0063,"15":0.0138,"5":0.0075}}}}}}
2019-03-27T20:13:24.575+0800    ERROR    pipeline/output.go:121    Failed to publish events: write tcp [::1]:48338->[::1]:5044: write: connection reset by peer

原因是同時有多個logstash程序在運作，關閉重新開機

問題3.3 将Filebeat 配置成服務進行管理

filebeat 服務所在路徑：

/etc/systemd/system

編輯filebeat.service檔案

[Unit]
Description=filebeat.service
[Service]
User=root
ExecStart=/data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat -e -c /data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat.yml
[Install]
WantedBy=multi-user.target

管理服務的相關指令

systemctl start filebeat              #啟動filebeat服務
systemctl enable filebeat             #設定開機自啟動
systemctl disable filebeat            #停止開機自啟動
systemctl status filebeat             #檢視服務目前狀态
systemctl restart filebeat　          #重新啟動服務
systemctl list-units --type=service        #檢視所有已啟動的服務

問題3.4 Filebeat 服務啟動報錯

注意錯誤

Exiting: error loading config file: yaml: line 29: did not find expected key

主要問題是：filebeat.yml 檔案中的格式有破壞，應特别注意修改和新增的地方，對照前後文，驗證格式是否有變化。

問題 3.5 Linux 版本過低，無法以systemctl管理filebeat服務

此時我們可以以service來管理，在目錄init.d下建立一個filebeat.service檔案。主要腳本如下：

#!/bin/bash
agent="/data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat"
args="-e -c /data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat.yml"
start() {
    pid=`ps -ef |grep /data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat |grep -v grep |awk '{print $2}'`
    if [ ! "$pid" ];then
        echo "Starting filebeat: "
       nohup  $agent $args >/dev/null 2>&1 &
        if [ $? == '0' ];then
            echo "start filebeat ok"
        else
            echo "start filebeat failed"
        fi
    else
        echo "filebeat is still running!"
        exit
    fi
}
stop() {
    echo -n $"Stopping filebeat: "
    pid=`ps -ef |grep /data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat |grep -v grep |awk '{print $2}'`
    if [ ! "$pid" ];then
echo "filebeat is not running"
    else
        kill $pid
echo "stop filebeat ok"
    fi
}
restart()
 {
    stop
    start
}
status(){
    pid=`ps -ef |grep /data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat |grep -v grep |awk '{print $2}'`
    if [ ! "$pid" ];then
        echo "filebeat is not running"
    else
        echo "filebeat is running"
    fi
}
case "$1" in
    start)
        start
    ;;
    stop)
        stop
    ;;
    restart)
        restart
    ;;
    status)
        status
    ;;
    *)
        echo $"Usage: $0 {start|stop|restart|status}"
        exit 1
esac

注意事項

1.檔案授予執行權限

chmod 755 filebeat.service

2.設定開機自啟動

chkconfig --add filebeat.service

上面的服務添加自啟動時，會報錯

解決方案在 service file的開頭添加以下兩行

即修改完善後的代碼如下：

#!/bin/bash

# chkconfig:   2345 10 80
# description:  filebeat is a tool for colletct log data

agent="/data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat"
args="-e -c /data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat.yml"
start() {
    pid=`ps -ef |grep /data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat |grep -v grep |awk '{print $2}'`
    if [ ! "$pid" ];then
        echo "Starting filebeat: "
       nohup  $agent $args >/dev/null?2>&1 &
        if [ $? == '0' ];then
            echo "start filebeat ok"
        else
            echo "start filebeat failed"
        fi
    else
        echo "filebeat is still running!"
        exit
    fi
}
stop() {
    echo -n $"Stopping filebeat: "
    pid=`ps -ef |grep /data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat |grep -v grep |awk '{print $2}'`
    if [ ! "$pid" ];then
echo "filebeat is not running"
    else
        kill $pid
echo "stop filebeat ok"
    fi
}
restart()
 {
    stop
    start
}
status(){
    pid=`ps -ef |grep /data/filebeat/filebeat-7.4.2-linux-x86_64/filebeat |grep -v grep |awk '{print $2}'`
    if [ ! "$pid" ];then
        echo "filebeat is not running"
    else
        echo "filebeat is running"
    fi
}
case "$1" in
    start)
        start
    ;;
    stop)
        stop
    ;;
    restart)
        restart
    ;;
    status)
        status
    ;;
    *)
        echo $"Usage: $0 {start|stop|restart|status}"
        exit 1
esac

第四部分 Logstash

問題 4.1 服務化配置

logstash最常見的運作方式即指令行運作./bin/logstash -f logstash.conf啟動，結束指令是ctrl+c。這種方式的優點在于運作友善，缺點是不便于管理，同時如果遇到伺服器重新開機，則維護成本會更高一些，如果在生産環境運作logstash推薦使用服務的方式。以服務的方式啟動logstash，同時借助systemctl的特性實作開機自啟動。

（1）安裝目錄下的config中的startup.options需要修改

修改主要項:

1.服務預設啟動使用者和使用者組為logstash；可以修改為root；

2. LS_HOME 參數設定為 logstash的安裝目錄；例如：/data/logstash/logstash-7.6.0

3. LS_SETTINGS_DIR參數配置為含有logstash.yml的目錄；例如：/data/logstash/logstash-7.6.0/config

4. LS_OPTS 參數項，添加 logstash.conf 指定項（-f參數）；例如：LS_OPTS="--path.settings ${LS_SETTINGS_DIR} -f /data/logstash/logstash-7.6.0/config/logstash.conf"

（2）以root身份執行logstash指令建立服務

建立服務的指令

安裝目錄/bin/system-install

執行建立指令後，在/etc/systemd/system/目錄中生成了logstash.service 檔案

（3）logstash 服務的管理

設定服務自啟動：systemctl enable logstash

啟動服務：systemctl start logstash

停止服務：systemctl stop logstash

重新開機服務：systemctl restart logstash

檢視服務狀态：systemctl status logstash

問題 4.2 安裝logstash服務需先安裝jdk

報錯提示如下：

通過檢視jave版本，驗證是否已安裝

上圖說明沒有安裝。則将安裝包下載下傳（或上傳）至本地，執行安裝

執行安裝指令如下：

yum localinstall jdk-8u211-linux-x64.rpm

安裝OK，執行驗證

問題 4.3 Linux 版本過低，安裝 logstash 服務失效

問題提示

檢視Linux系統版本

原因： centos 6.5 不支援 systemctl 管理服務

方案驗證

相關指令

1.啟動指令
initctl start logstash
2.檢視狀态
initctl status logstash

注意以下生成服務的指令還是要執行的

./system-install

否則提示錯誤

initctl: Unknown job: logstash

問題 4.4 配置檔案中定義的index name 命名需小寫

"Invalid index name [mysql-error-Test-2019.05.13], must be lowercase", "index_uuid"=>"_na_", "index"=>"mysql-error-Test-2019.05.13"}}}}
May 13 13:36:33 hzvm1996 logstash[123194]: [2019-05-13T13:36:33,907][ERROR][logstash.outputs.elasticsearch][main] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"mysql-slow-Test-2020.05.13", :routing=>nil, :_type=>"_doc"}, #<LogStash::Event:0x1f0aedbc>], :response=>{"index"=>{"_index"=>"mysql-slow-Test-2019.05.13", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"invalid_index_name_exception", "reason"=>"Invalid index name [mysql-slow-Test-2019.05.13], must be lowercase", "index_uuid"=>"_na_", "index"=>"mysql-slow-Test-2019.05.13"}}}}
May 13 13:38:50 hzvm1996 logstash[123194]: [2019-05-13T13:38:50,765][ERROR][logstash.outputs.elasticsearch][main] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"mysql-error-Test-2020.05.13", :routing=>nil, :_type=>"_doc"}, #<LogStash::Event:0x4bdce1db>], :response=>{"index"=>{"_index"=>"mysql-error-Test-2019.05.13", "_type"=>"_doc", "_id"=>nil, "status"=>400, "error"=>{"type"=>"invalid_index_name_exception", "reason"=>"Invalid index name [mysql-error-Test-2019.05.13], must be lowercase", "index_uuid"=>"_na_", "index"=>"mysql-error-Test-2019.05.13"}}}}

4.5 logstash 占用記憶體過大

jvm.options 這個配置檔案是有關jvm的配置，可以配置運作時記憶體的最大最小值，垃圾清理機制等.

## JVM configuration

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms1g
-Xmx1g

預設為1G，建議調整為512m 或者 256m。

第五部分 kibana

問題5.1 開啟密碼認證

[root@testkibaba bin]# ./kibana-plugin install x-pack
Plugin installation was unsuccessful due to error "Kibana now contains X-Pack by default, there is no longer any need to install it as it is already present.

說明：新版本的Elasticsearch和Kibana都已經支援自帶支援x-pack了，不需要進行顯式安裝。老版本的需要進行安裝。

問題5.2 應用啟動報錯

[root@testkibana bin]# ./kibana

報錯

Kibana should not be run as root.  Use --allow-root to continue.

添加個專門的賬号

useradd qqweixinkibaba --添加賬号
chown -R qqweixinkibaba:hzdbakibaba kibana-7.4.2-linux-x86_64 --為新增賬号賦予文檔目錄的權限
su qqweixinkibaba ---切換賬号，讓後再啟動

問題5.3 登入kibana報錯

{"statusCode":403,"error":"Forbidden","message":"Forbidden"}

報錯原因是：用kibana賬号登入kibana報錯，改為elastic使用者就行了

問題5.4 多租戶實作的問題

一個公司會有多個業務線，也可能會有多個研發小組，那麼如何實作收集到的資料隻對相應的團隊開放呢？即實作隻能看到自家的資料。一種思路就是搭建多個ELK，一個業務線一個ELK，但這個方法會導緻資源浪費和增加運維工作量；另一種思路就是通過多租戶來實作。

實作時，應注意以下問題：

要在 elastic 賬号下，轉到指定的空間（space）下，再設定 index pattern 。

先建立role（注意與space關聯），最後建立user。

參考資料

1.https://www.jianshu.com/p/0a5acf831409 《ELK應用之Filebeat》

2.http://www.voidcn.com/article/p-nlietamt-zh.html 《filebeat 啟動腳本》

3.https://www.bilibili.com/video/av68523257/?redirectFrom=h5 《ElasticTalk #22 Kibana 多租戶介紹與實戰》

4.https://www.cnblogs.com/shengyang17/p/10597841.html 《ES叢集》

5.https://www.jianshu.com/p/54cdddf89989 《Logstash配置以服務方式運作》

6.https://www.elastic.co/guide/en/logstash/current/running-logstash.html#running-logstash-upstart 《Running Logstash as a Service on Debian or RPM》

7.https://www.cnblogs.com/sanduzxcvbnm/p/11982476.html