一次gipc程序網卡狀态問題的排查
問題:客戶聯系我一個叢集節點無法啟動。連上後發現之前由于心跳網絡問題,叢集節點2 GRID自動重新開機,重新開機時依然報心跳網絡異常,是以啟動不成功。檢查此時的心跳網絡已經恢複,手動啟動叢集軟體仍然未解決。最終解決後總結:這個問題的處理,主要需要對11G RAC的架構(幾個AGENT管理不同資源)以及叢集中不同程序管理的資源以及互動情況有了解,可以知道OCSSD程序認為心跳網絡異常,心跳網絡狀态是GIPCD程序監控是以需要看GIPCD程序的日志。通過日志發現問題後,再嘗試解決。正常解決方法無效後,要繼續排查包括往BUG方面排查,這個問題就是低版本叢集上的BUG(本次異常的叢集是LINUX+11.2.0.3,無PSU)。在11.2.0.4版本叢集上至今未遇到過此問題。 相關處理過程如下:
1.檢視叢集日志及ocssd.log日志
===可以發現是典型的心跳網絡異常導緻的叢集軟體重新開機(oracle 11g 新特性Rebootless Restart特性預設是重新開機GRID不重新開機主機)
[crsd(7657)]CRS-2772:Server 'test1' has been assigned to pool 'Generic'.
2020-12-04 10:34:10.572
[crsd(7657)]CRS-2772:Server 'test1' has been assigned to pool 'ora.ipcc'.
2021-01-16 23:34:20.127
[cssd(7161)]CRS-1612:Network communication with node test1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.940 seconds
2021-01-16 23:34:28.128
[cssd(7161)]CRS-1611:Network communication with node test1 (1) missing for 75% of timeout interval. Removal of this node from cluster in 6.940 seconds
2021-01-16 23:34:32.129
[cssd(7161)]CRS-1610:Network communication with node test1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.940 seconds
2021-01-16 23:34:35.071
[cssd(7161)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /
opt/11.2.0/grid/log/test2/cssd/ocssd.log.
2021-01-16 23:34:35.071
[cssd(7161)]CRS-1656:The CSS daemon is terminating due to a fatal error;
2.檢查作業系統 日志
當時确實心中網絡異常,但是目前已經恢複。
3.重新開機叢集軟體時報無網絡心跳
日志中一直報has a disk HB, but no network HB
2021-01-20 18:18:43.082: [ CSSD][2730108672]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2021-01-20 18:18:43.236: [ CSSD][2734855936]clssnmvDHBValidateNCopy: node 1, test1, has a disk HB, but no network HB, DHB has rcfg 375467028, wrtcnt, 130988652, LATS 4294435040, lastSeqNo 130988651, uniqueness 1607049396, timestamp 1611137741/4086845544
2021-01-20 18:18:44.082: [ CSSD][2730108672]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2021-01-20 18:18:44.236: [ CSSD][2734855936]clssnmvDHBValidateNCopy: node 1, test1, has a disk HB, but no network HB, DHB has rcfg 375467028, wrtcnt, 130988653, LATS 4294436040, lastSeqNo 130988652, uniqueness 1607049396, timestamp 1611137742/4086846554
2021-01-20 18:18:45.082: [ CSSD][2730108672]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2021-01-20 18:18:45.236: [ CSSD][2734855936]clssnmvDHBValidateNCopy: node 1, test1, has a disk HB, but no network HB, DHB has rcfg 375467028, wrtcnt, 130988654, LATS 4294437040, lastSeqNo 130988653, uniqueness 1607049396, timestamp 1611137743/4086847554
2021-01-20 18:18:45.238: [ CSSD][2745972480]clssscSelect: cookie accept request 0xc39eb0
4.問題的分析:
心跳網絡中斷後已經恢複,叢集始終認為心跳網絡是異常。
是以我們基于11G RAC叢集各個元件AGENT及啟動流程進行分析梳理,GIPCD程序負責監控心跳網絡的可用性。
是以檢視GIPCD程序的日志,可以發現GIPCD程序一直認為網卡狀态是異常的(rank值為0或-1說明網絡異常,rank 99表示正常)
2021-01-20 18:16:52.489: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:16:52.634: [ CLSINET][2970949376] Returning NETDATA: 1 interfaces
[[email protected] gipcd]$ tail -n 500 gipcd.log |grep rank
2021-01-20 18:16:22.480: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 33 / 0 / 0 ]
2021-01-20 18:16:52.489: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:17:22.498: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:17:52.505: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:18:22.515: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 33 / 0 / 0 ]
2021-01-20 18:18:52.524: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:19:22.532: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:19:52.541: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 33 / 0 / 0 ]
2021-01-20 18:20:22.541: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:20:52.551: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0] eth1 - rank 0, avgms 30000000000.000000 [ 33 / 0 / 0 ]
5.問題的處理
根據11G RAC架構的AGENT資源管理情況,gipcd程序是由ohasd對應的代理程序
所管理的。可以KILL GIPCD程序之後ohasd會啟動新的gipcd守護程序。
是以我們嘗試了:
1.KILL此程序,之後問題未解決。
2.嘗試重新開機故障節點2的叢集軟體,問題未解決。
3.重新開機了節點2主機,問題未解決。
查一下MOS上的文檔:
11gR2 GI Node May not Join the Cluster After Private Network is Functional After Eviction due to Private Network Problem (Doc ID 1479380.1)
Bug 13653178 - OCSSD from a rebooted node cannot rejoin the cluster (Doc ID 13653178.8)
解決方法就是非業務高峰KILL存活節點的GIPCD程序。本次按照此方法,解決了問題。