現象:
近期幾天一個華為RH2285server一直不定時自己主動重新啟動。基本每天一兩次,檢視系統日志報以下的錯誤,每秒記錄一條錯誤日志
OS:OEL 6.5
$ more /var/log/message
Jul 21 08:54:32 customerkernel: EDAC MC1: 5486 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)
Jul 21 08:54:33 customerkernel: EDAC MC1: 11480 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)
Jul 21 08:54:34 customerkernel: EDAC MC1: 11330 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)
Jul 21 08:54:35 customerkernel: EDAC MC1: 6584 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)
Jul 21 08:54:36 customerkernel: EDAC MC1: 27428 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)
Jul 21 08:54:37 customerkernel: EDAC MC1: 30113 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)
Jul 21 08:54:38 customerkernel: EDAC MC1: 4453 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)
Jul 21 08:54:39 customerkernel: EDAC MC1: 6269 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)
Jul 21 08:54:40 customer kernel:EDAC MC1: 15720 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1 page:0x0offset:0x0 grain:8 syndrome:0x0)
Jul 21 08:54:41 customerkernel: EDAC MC1: 16107 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0)
分析解決:
這個是[EDAC (Error Detection AndCorrection)](https://www.kernel.org/doc/Documentation/edac.txt) 的日志.
CE Error 是 Correctable Error 的簡稱。另外還有 UE(Uncorrectable Error)
依照上面的文檔, 找出錯誤的DIMM:
[root@customer log]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow1/ch2_ce_count:554836518
查到是 /mc1/csrow1/ch2, 依據結構圖:
Channel 0 Channel 1
===================================
csrow0 | DIMM_A0 | DIMM_B0 |
csrow1 | DIMM_A0 | DIMM_B0 |
csrow2 | DIMM_A1 | DIMM_B1 |
csrow3 | DIMM_A1 | DIMM_B1 |
然後通過dmidecode檢視:
[root@customer log]# dmidecode -t memory |grep 'Locator: DIMM'
Locator: DIMM_D0
Locator: DIMM_D1
Locator: DIMM_E0
Locator: DIMM_E1
Locator: DIMM_F0
Locator: DIMM_F1
Locator: DIMM_A0
Locator: DIMM_A1
Locator: DIMM_B0
Locator: DIMM_B1
Locator: DIMM_C0
Locator: DIMM_C1
通過server控制台檢視記憶體:
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsISPrdEZwZ1Rh5WNXp1bwNjW1ZUba9VZwlHdsATOfd3bkFGazxCMx8VesATMfhHLlN3XnxCMwEzX0xiRGZkRGZ0Xy9GbvNGLpZTY1EmMZVDUSFTU4VFRR9Fd4VGdsYTMfVmepNHLrJXYtJXZ0F2dvwVZnFWbp1zczV2YvJHctM3cv1Ce-cmbw5yM4kjYzcTYyUjN2YzN3QGN5Q2YzATNlZ2NkJGZ4gTMz8CXyAzLchDMxIDMy8CXn9Gbi9CXzV2Zh1WavwVbvNmLvR3YxUjL1M3Lc9CX6MHc0RHaiojIsJye.png)
主機闆上記憶體插槽的分布:
結合報錯日志:kernel: EDAC MC1: 16107 CE error on CPU#1Channel#2_DIMM#1 (channel:2slot:1
應該是記憶體插槽DIMM_F1的問題。
解決:
最後我們要做的就是,把有問題的F1插槽上的記憶體拔出來或是更換到其他的記憶體插槽上面,之後系統啟動後不再報錯。
參考:
http://blog.tankywoo.com/2014/12/02/edac-dimm-ce-error.html
http://serverfault.com/questions/648240/how-can-i-find-which-memory-have-ce-error