檢視yarn任務日志 任務運作中和結束時檢視container的日志 GC日志的簡單分析
文章目錄
- 一、shell和yarn基礎指令準備
- 二、app日志檢視
-
- 1. 直接使用yarn logs,可選擇指定日志類型,也可"手動"重定向至本地檔案
- 2. 使用yarn logs檢視,并使用-out參數"自動"落盤本地檔案
- 3. 檢視running狀态app的containerId及其日志
- 4. 如何查詢已結束運作的container資訊
- 5. 檢視gc日志
- 總結
一、shell和yarn基礎指令準備
1. 别名alias和錯誤輸出重定向
$ alias lg='yarn logs -applicationId application_1652362266025_4019 [email protected] 2>/dev/null'
# 這樣後面測試的時候就不用每次書寫前年那固定的一長串了 -log_files <args> 等參數将填充至"[email protected]"位置
# 2>/dev/null 錯誤日志重定向至黑洞,更友善檢視結果
$ lg -log_files stdout | head
Container: container_e105_1652362266025_4019_01_000022 on ****-bg-w03_45454_1652490636096
LogAggregationType: AGGREGATED
=========================================================================================
LogType:stdout
LogLastModifiedTime:Sat May 14 09:10:36 +0800 2022
LogLength:6368
LogContents:
2022-05-14 08:11:37 Starting to run new task attempt: attempt_1652362266025_4019_1_10_000002_0
1.783: [GC pause (Metadata GC Threshold) (young) (initial-mark), 0.0371225 secs]
[Parallel Time: 13.8 ms, GC Workers: 23]
2. yarn的幾個基礎指令
app|application prints application(s) report/kill application/manage long running application
applicationattempt prints applicationattempt(s) report
container prints container(s) report
logs dump container logs
3. find的-type參數 及 grep的-o -e參數
find [options]
-type c
d 檔案夾:directory
p named pipe (FIFO)
f 正常檔案:regular file
s 套接字:socket
grep [options]
-e, --regexp=PATTERN 正則:use PATTERN for matching
-i, --ignore-case 忽略大小寫:ignore case distinctions
-w, --word-regexp 比對完整單詞:force PATTERN to match only whole words
-v, --invert-match 反向選擇比對内容:select non-matching lines
-n, --line-number 列印所在行号:print line number with output lines
-h, --no-filename 結果首列不輸出檔案名:suppress the file name prefix on output
-o, --only-matching 隻輸出比對到的部分:show only the part of a line matching PATTERN
-r, --recursive 檔案夾遞歸:like --directories=recurse
[注]:
egrep 等同 grep -e
二、app日志檢視
1. 直接使用yarn logs,可選擇指定日志類型,也可"手動"重定向至本地檔案
代碼如下(示例):
$ yarn logs -applicationId application_1652362266025_4019 -log_files stderr 2>/dev/null | head
Container: container_e105_1652362266025_4019_01_000022 on ***-bg-w03_45454_1652490636096
LogAggregationType: AGGREGATED
=========================================================================================
LogType:stderr
LogLastModifiedTime:Sat May 14 09:10:36 +0800 2022
LogLength:188
LogContents:
2022-05-14 08:11:37 Starting to run new task attempt: attempt_1652362266025_4019_1_10_000002_0
2022-05-14 08:11:41 Completed running task attempt: attempt_1652362266025_4019_1_10_000002_0
...
2. 使用yarn logs檢視,并使用-out參數"自動"落盤本地檔案
# 沒權限安裝tree 此處用的另一叢集
# 将日志落盤到目前檔案夾logs中
$ yarn logs -applicationId application_1650527019982_0697 -out ./logs
# 檢視日志目錄結構 子目錄為節點名稱 内部的檔案為節點中運作此app的container的日志
$ tree logs/
logs/
├── node-group-***1.mrs-lcnd.com_8041
│ ├── container_e03_1650527019982_0697_01_000001
│ ├── container_e03_1650527019982_0697_01_000003
│ ├── container_e03_1650527019982_0697_01_000012
...
├── node-group-***2.mrs-lcnd.com_8041
│ ├── container_e03_1650527019982_0697_01_000002
│ ├── container_e03_1650527019982_0697_01_000004
│ ├── container_e03_1650527019982_0697_01_000006
...
│ └── container_e03_1650527019982_0697_01_000044
└── node-group-***3.mrs-lcnd.com_8041
├── container_e03_1650527019982_0697_01_000005
├── container_e03_1650527019982_0697_01_000009
├── container_e03_1650527019982_0697_01_000010
...
# 因為沒有使用-log_files進行過濾,所有每個檔案都包含所屬container的所有日志類型,可使用以下指令驗證:
$ cat logs/node-group-***1.mrs-lcnd.com_8041/container_e03_1650527019982_0697_01_000001 | egrep -i '^LogType'
LogType:container-localizer-syslog
LogType:directory.info
LogType:launch_container.sh
LogType:prelaunch.out
LogType:stderr
LogType:stdout
LogType:syslog
3. 檢視running狀态app的containerId及其日志
# 1. 擷取application的attemptId
$ yarn applicationattempt -list application_1652362266025_4832 2>/dev/null
Total number of application attempts :1
ApplicationAttempt-Id State AM-Container-Id Tracking-URL
appattempt_1652362266025_4832_000001 RUNNING container_e105_1652362266025_4832_01_000001 http://****-bg-w01:8088/proxy/application_1652362266025_4832/
# 2. 擷取此attempt對應的container
$ yarn container -list appattempt_1652362266025_4832_000001 2>/dev/null
# 輸出格式同上 為友善檢視 此處轉換一下格式
Total number of containers :1
# 表頭 # 内容
Container-Id container_e105_1652362266025_4832_01_000001
Start Time Sat May 14 16:19:38 +0800 2022
Finish Time N/A
State RUNNING
Host ****-bg-w18:45454
Node Http Address http://****-bg-w18:8042
LOG-URL http://****-bg-w01:8188/applicationhistory/logs/****-bg-w18:45454/container_e105_1652362266025_4832_01_000001/container_e105_1652362266025_4832_01_000001/dmp_operator1
# 此為一個hive應用,目前沒有查詢任務,清單中隻有1個running狀态的container,對比發現,此container就是am所在的container,當有查詢任務運作時,此處會出現多個container
# 3. 檢視container的日志
$ yarn logs -containerId container_e105_1652362266025_4835_01_000001 -log_files stderr 2>/dev/null | head
Container: container_e105_1652362266025_4835_01_000001 on ****-bg-w19:45454
LogAggregationType: LOCAL
===========================================================================
LogType:stderr
LogLastModifiedTime:Sat May 14 16:32:26 +0800 2022
LogLength:472
LogContents:
2022-05-14 16:23:24 Running Dag: dag_1652362266025_4835_1
2022-05-14 16:23:59 Completed Dag: dag_1652362266025_4835_1
2022-05-14 16:25:45 Running Dag: dag_1652362266025_4835_2
...
4. 如何查詢已結束運作的container資訊
通過3發現,通過attemptId隻能查詢處于運作狀态的container資訊,如果想要檢視曆史的資訊,可以使用yarn logs
$ lg -show_application_log_info
Application State: Completed.
Container: container_e105_1652362266025_4019_01_000022 on ****-bg-w03_45454_1652490636096
Container: container_e105_1652362266025_4019_01_000009 on ****-bg-w04_45454_1652490635977
Container: container_e105_1652362266025_4019_01_000014 on ****-bg-w04_45454_1652490635977
Container: container_e105_1652362266025_4019_01_000020 on ****-bg-w06_45454_1652490636083
Container: container_e105_1652362266025_4019_01_000012 on ****-bg-w07_45454_1652490636059
...
# 使用egrep -o 從中切分出container的id
$ lg -show_application_log_info | egrep -o 'container_e[_0-9]*'
container_e105_1652362266025_4019_01_000022
container_e105_1652362266025_4019_01_000009
container_e105_1652362266025_4019_01_000014
container_e105_1652362266025_4019_01_000020
container_e105_1652362266025_4019_01_000012
...
# 同理也可切分出node節點,不過node節點也可以使用指令檢視
$ lg -list_nodes
****-bg-w03_45454_1652490636096
****-bg-w04_45454_1652490635977
****-bg-w06_45454_1652490636083
****-bg-w07_45454_1652490636059
****-bg-w10_45454_1652490636542
...
5. 檢視gc日志
萬字長文教你看懂java G1垃圾回收日志
gc日志在yarn中的類型為:LogType:stdout
通過下圖,我們主要從日志提取 時間offset 和 gc實際用時real 兩行,對gc情況有個大緻的了解
1. 使用指令檢視
# app級别: yarn logs -applicationId <application ID>
# container級别: yarn logs -applicationId <application ID> -containerId <Container ID>
# 實操中可能後者更有意義,因為gc一般出現在個别container中
$ lg -containerId container_e105_1652362266025_4019_01_000014 -log_files stdout | egrep '\[GC pause|real='
1.529: [GC pause (Metadata GC Threshold) (young) (initial-mark), 0.0286059 secs]
[Times: user=0.24 sys=0.02, real=0.03 secs]
[Times: user=0.09 sys=0.00, real=0.01 secs]
[Times: user=0.03 sys=0.01, real=0.01 secs]
...
273.086: [GC pause (G1 Evacuation Pause) (young), 0.0094619 secs]
[Times: user=0.07 sys=0.01, real=0.01 secs]
400.869: [GC pause (G1 Humongous Allocation) (young) (initial-mark), 0.0097788 secs]
[Times: user=0.07 sys=0.02, real=0.01 secs]
[Times: user=0.15 sys=0.00, real=0.01 secs]
[Times: user=0.02 sys=0.01, real=0.01 secs]
# 輸出格式化
$ lg -containerId container_e105_1652362266025_4019_01_000014 -log_files stdout | egrep '\[GC pause|real=' | column -s, -t
1.529: [GC pause (Metadata GC Threshold) (young) (initial-mark) 0.0286059 secs]
[Times: user=0.24 sys=0.02 real=0.03 secs]
[Times: user=0.09 sys=0.00 real=0.01 secs]
[Times: user=0.03 sys=0.01 real=0.01 secs]
2.617: [GC pause (Metadata GC Threshold) (young) (initial-mark) 0.0365129 secs]
[Times: user=0.37 sys=0.04 real=0.04 secs]
[Times: user=0.15 sys=0.00 real=0.01 secs]
[Times: user=0.03 sys=0.00 real=0.01 secs]
...
2. 使用-out落盤到本地,使用shell指令檢視
# 使用指令找出絕對路徑
$ find logs/ -type f
logs/****-bg-w03_45454_1652490636096/container_e105_1652362266025_4019_01_000022
logs/****-bg-w17_45454_1652490635505/container_e105_1652362266025_4019_01_000005
logs/****-bg-w21_45454_1652490636022/container_e105_1652362266025_4019_01_000011
logs/****-bg-w10_45454_1652490636542/container_e105_1652362266025_4019_01_000023
...
# 同上,cat後使用egrep進行檢視
# 這裡示範一下gc次數統計
$ find logs/ -type f | while read line
> do
> echo $line
> egrep '\[GC pause' $line | wc -l
> done
logs/****-bg-w03_45454_1652490636096/container_e105_1652362266025_4019_01_000022
3
logs/****-bg-w17_45454_1652490635505/container_e105_1652362266025_4019_01_000005
5
logs/****-bg-w21_45454_1652490636022/container_e105_1652362266025_4019_01_000011
2
logs/****-bg-w10_45454_1652490636542/container_e105_1652362266025_4019_01_000023
1304
logs/****-bg-w06_45454_1652490636083/container_e105_1652362266025_4019_01_000020
5
...
# 很顯然,其中一個container的gc次數過于頻繁
總結
其實yarn開啟日志聚合功能以後,很多都可以網頁端檢視,但是使用指令其實可以實作更加強大和精細化的功能,可以通過官方的–help文檔多多學習和測試。
上述指令也可以寫一個簡單的腳本進行封裝,以友善出現異常時進行快速定位。
如有錯誤或者更好的方法還請大家不吝指正,大家一起變得更強!