天天看點

一次11g rac由于gipc程序無法識别網卡狀态導緻叢集無法啟動的排查

一次gipc程序網卡狀态問題的排查

問題:客戶聯系我一個叢集節點無法啟動。連上後發現之前由于心跳網絡問題,叢集節點2 GRID自動重新開機,重新開機時依然報心跳網絡異常,是以啟動不成功。檢查此時的心跳網絡已經恢複,手動啟動叢集軟體仍然未解決。最終解決後總結:這個問題的處理,主要需要對11G RAC的架構(幾個AGENT管理不同資源)以及叢集中不同程序管理的資源以及互動情況有了解,可以知道OCSSD程序認為心跳網絡異常,心跳網絡狀态是GIPCD程序監控是以需要看GIPCD程序的日志。通過日志發現問題後,再嘗試解決。正常解決方法無效後,要繼續排查包括往BUG方面排查,這個問題就是低版本叢集上的BUG(本次異常的叢集是LINUX+11.2.0.3,無PSU)。在11.2.0.4版本叢集上至今未遇到過此問題。    相關處理過程如下:

1.檢視叢集日志及ocssd.log日志

===可以發現是典型的心跳網絡異常導緻的叢集軟體重新開機(oracle 11g 新特性Rebootless Restart特性預設是重新開機GRID不重新開機主機)
[crsd(7657)]CRS-2772:Server 'test1' has been assigned to pool 'Generic'.
2020-12-04 10:34:10.572
[crsd(7657)]CRS-2772:Server 'test1' has been assigned to pool 'ora.ipcc'.
2021-01-16 23:34:20.127
[cssd(7161)]CRS-1612:Network communication with node test1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.940 seconds
2021-01-16 23:34:28.128
[cssd(7161)]CRS-1611:Network communication with node test1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.940 seconds
2021-01-16 23:34:32.129
[cssd(7161)]CRS-1610:Network communication with node test1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.940 seconds
2021-01-16 23:34:35.071
[cssd(7161)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /
opt/11.2.0/grid/log/test2/cssd/ocssd.log.
2021-01-16 23:34:35.071
[cssd(7161)]CRS-1656:The CSS daemon is terminating due to a fatal error; 
           

2.檢查作業系統 日志

當時确實心中網絡異常,但是目前已經恢複。

3.重新開機叢集軟體時報無網絡心跳

日志中一直報has a disk HB, but no network HB
2021-01-20 18:18:43.082: [    CSSD][2730108672]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2021-01-20 18:18:43.236: [    CSSD][2734855936]clssnmvDHBValidateNCopy: node 1, test1, has a disk HB, but no network HB, DHB has rcfg 375467028, wrtcnt, 130988652, LATS 4294435040, lastSeqNo 130988651, uniqueness 1607049396, timestamp 1611137741/4086845544
2021-01-20 18:18:44.082: [    CSSD][2730108672]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2021-01-20 18:18:44.236: [    CSSD][2734855936]clssnmvDHBValidateNCopy: node 1, test1, has a disk HB, but no network HB, DHB has rcfg 375467028, wrtcnt, 130988653, LATS 4294436040, lastSeqNo 130988652, uniqueness 1607049396, timestamp 1611137742/4086846554
2021-01-20 18:18:45.082: [    CSSD][2730108672]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2021-01-20 18:18:45.236: [    CSSD][2734855936]clssnmvDHBValidateNCopy: node 1, test1, has a disk HB, but no network HB, DHB has rcfg 375467028, wrtcnt, 130988654, LATS 4294437040, lastSeqNo 130988653, uniqueness 1607049396, timestamp 1611137743/4086847554
2021-01-20 18:18:45.238: [    CSSD][2745972480]clssscSelect: cookie accept request 0xc39eb0
           

4.問題的分析:

心跳網絡中斷後已經恢複,叢集始終認為心跳網絡是異常。
是以我們基于11G RAC叢集各個元件AGENT及啟動流程進行分析梳理,GIPCD程序負責監控心跳網絡的可用性。
是以檢視GIPCD程序的日志,可以發現GIPCD程序一直認為網卡狀态是異常的(rank值為0或-1說明網絡異常,rank 99表示正常)
2021-01-20 18:16:52.489: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:16:52.634: [ CLSINET][2970949376] Returning NETDATA: 1 interfaces

[[email protected] gipcd]$ tail -n 500 gipcd.log |grep rank
2021-01-20 18:16:22.480: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 33 / 0 / 0 ]
2021-01-20 18:16:52.489: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:17:22.498: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:17:52.505: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:18:22.515: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 33 / 0 / 0 ]
2021-01-20 18:18:52.524: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:19:22.532: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:19:52.541: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 33 / 0 / 0 ]
2021-01-20 18:20:22.541: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:20:52.551: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 33 / 0 / 0 ]
           

5.問題的處理

根據11G RAC架構的AGENT資源管理情況,gipcd程序是由ohasd對應的代理程序

所管理的。可以KILL GIPCD程序之後ohasd會啟動新的gipcd守護程序。

是以我們嘗試了:

        1.KILL此程序,之後問題未解決。

        2.嘗試重新開機故障節點2的叢集軟體,問題未解決。

        3.重新開機了節點2主機,問題未解決。

查一下MOS上的文檔:

11gR2 GI Node May not Join the Cluster After Private Network is Functional After Eviction due to Private Network Problem (Doc ID 1479380.1)

Bug 13653178 - OCSSD from a rebooted node cannot rejoin the cluster (Doc ID 13653178.8)

解決方法就是非業務高峰KILL存活節點的GIPCD程序。本次按照此方法,解決了問題。