一次11g rac由于gipc程序無法識别網卡狀态導緻叢集無法啟動的排查

一次gipc程序網卡狀态問題的排查

問題：客戶聯系我一個叢集節點無法啟動。連上後發現之前由于心跳網絡問題，叢集節點2 GRID自動重新開機，重新開機時依然報心跳網絡異常，是以啟動不成功。檢查此時的心跳網絡已經恢複，手動啟動叢集軟體仍然未解決。最終解決後總結：這個問題的處理，主要需要對11G RAC的架構(幾個AGENT管理不同資源)以及叢集中不同程序管理的資源以及互動情況有了解，可以知道OCSSD程序認為心跳網絡異常，心跳網絡狀态是GIPCD程序監控是以需要看GIPCD程序的日志。通過日志發現問題後，再嘗試解決。正常解決方法無效後，要繼續排查包括往BUG方面排查，這個問題就是低版本叢集上的BUG(本次異常的叢集是LINUX+11.2.0.3，無PSU)。在11.2.0.4版本叢集上至今未遇到過此問題。相關處理過程如下：

1.檢視叢集日志及ocssd.log日志

===可以發現是典型的心跳網絡異常導緻的叢集軟體重新開機(oracle 11g 新特性Rebootless Restart特性預設是重新開機GRID不重新開機主機)
[crsd(7657)]CRS-2772:Server 'test1' has been assigned to pool 'Generic'.
2020-12-04 10:34:10.572
[crsd(7657)]CRS-2772:Server 'test1' has been assigned to pool 'ora.ipcc'.
2021-01-16 23:34:20.127
[cssd(7161)]CRS-1612:Network communication with node test1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.940 seconds
2021-01-16 23:34:28.128
[cssd(7161)]CRS-1611:Network communication with node test1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.940 seconds
2021-01-16 23:34:32.129
[cssd(7161)]CRS-1610:Network communication with node test1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.940 seconds
2021-01-16 23:34:35.071
[cssd(7161)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /
opt/11.2.0/grid/log/test2/cssd/ocssd.log.
2021-01-16 23:34:35.071
[cssd(7161)]CRS-1656:The CSS daemon is terminating due to a fatal error;

2.檢查作業系統日志

當時确實心中網絡異常，但是目前已經恢複。

3.重新開機叢集軟體時報無網絡心跳

日志中一直報has a disk HB, but no network HB
2021-01-20 18:18:43.082: [    CSSD][2730108672]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2021-01-20 18:18:43.236: [    CSSD][2734855936]clssnmvDHBValidateNCopy: node 1, test1, has a disk HB, but no network HB, DHB has rcfg 375467028, wrtcnt, 130988652, LATS 4294435040, lastSeqNo 130988651, uniqueness 1607049396, timestamp 1611137741/4086845544
2021-01-20 18:18:44.082: [    CSSD][2730108672]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2021-01-20 18:18:44.236: [    CSSD][2734855936]clssnmvDHBValidateNCopy: node 1, test1, has a disk HB, but no network HB, DHB has rcfg 375467028, wrtcnt, 130988653, LATS 4294436040, lastSeqNo 130988652, uniqueness 1607049396, timestamp 1611137742/4086846554
2021-01-20 18:18:45.082: [    CSSD][2730108672]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2021-01-20 18:18:45.236: [    CSSD][2734855936]clssnmvDHBValidateNCopy: node 1, test1, has a disk HB, but no network HB, DHB has rcfg 375467028, wrtcnt, 130988654, LATS 4294437040, lastSeqNo 130988653, uniqueness 1607049396, timestamp 1611137743/4086847554
2021-01-20 18:18:45.238: [    CSSD][2745972480]clssscSelect: cookie accept request 0xc39eb0

4.問題的分析：

心跳網絡中斷後已經恢複，叢集始終認為心跳網絡是異常。
是以我們基于11G RAC叢集各個元件AGENT及啟動流程進行分析梳理，GIPCD程序負責監控心跳網絡的可用性。
是以檢視GIPCD程序的日志，可以發現GIPCD程序一直認為網卡狀态是異常的(rank值為0或－1說明網絡異常，rank 99表示正常)
2021-01-20 18:16:52.489: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:16:52.634: [ CLSINET][2970949376] Returning NETDATA: 1 interfaces

[[email protected] gipcd]$ tail -n 500 gipcd.log |grep rank
2021-01-20 18:16:22.480: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 33 / 0 / 0 ]
2021-01-20 18:16:52.489: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:17:22.498: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:17:52.505: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:18:22.515: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 33 / 0 / 0 ]
2021-01-20 18:18:52.524: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:19:22.532: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:19:52.541: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 33 / 0 / 0 ]
2021-01-20 18:20:22.541: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 32 / 0 / 0 ]
2021-01-20 18:20:52.551: [GIPCDMON][2970949376] gipcdMonitorSaveInfMetrics: inf[ 0]  eth1                 - rank    0, avgms 30000000000.000000 [ 33 / 0 / 0 ]

5.問題的處理

根據11G RAC架構的AGENT資源管理情況，gipcd程序是由ohasd對應的代理程序

所管理的。可以KILL GIPCD程序之後ohasd會啟動新的gipcd守護程序。

是以我們嘗試了：

1.KILL此程序，之後問題未解決。

2.嘗試重新開機故障節點2的叢集軟體，問題未解決。

3.重新開機了節點2主機，問題未解決。

查一下MOS上的文檔：

11gR2 GI Node May not Join the Cluster After Private Network is Functional After Eviction due to Private Network Problem (Doc ID 1479380.1)

Bug 13653178 - OCSSD from a rebooted node cannot rejoin the cluster (Doc ID 13653178.8)

解決方法就是非業務高峰KILL存活節點的GIPCD程序。本次按照此方法，解決了問題。

一次11g rac由于gipc程序無法識别網卡狀态導緻叢集無法啟動的排查

1.檢視叢集日志及ocssd.log日志

3.重新開機叢集軟體時報無網絡心跳

4.問題的分析：

5.問題的處理

繼續閱讀

一次RAC主機資源使用異常導緻的節點重新開機

驗證11gR2 RAC中ASM執行個體通過gpnp profile獲得spfile資訊來啟動ASM執行個體

11gR2 RAC 新特性-SCAN-GNS-RAC One Node等