天天看點

Brain Split?

真正出現腦裂的幾率并不高,但确實讓我們碰上了。2個月前為一套AIX6.1上的10.2.0.4雙節點RAC系統做故障測試,主要内容是拔除RAC interconnect網線,測試CRS能否正确處理私有網絡挂掉的情況。   正式測試時發現2台主機都沒有重新開機,而兩端的CSS都認為對方節點已經down了。這就造成2個節點都以為自身是幸存者,也就是我們說的腦裂(brain split),此時執行個體一般會因為LMON程序的緣故而hang住。   我們來比對當時2個節點上的日志進一步分析:  

<a href="http://blog.51cto.com/maclean/1277131#">?</a>

<code>STEP 1 :20:41:19實體拔出網線後,節點間無法正常通信,進入misscount倒計時600s</code>

<code>節點1:</code>

<code>[    CSSD]2010-06-22 20:41:21.465 [3342] &gt;TRACE:   clssnmPollingThread: node gis2 (2) missed(2) checkin(s)</code>

<code>[    CSSD]2010-06-22 20:41:22.465 [3342] &gt;TRACE:   clssnmPollingThread: node gis2 (2) missed(3) checkin(s)</code>

<code>.........</code>

<code>[    CSSD]2010-06-22 20:51:17.956 [3342] &gt;TRACE:   clssnmPollingThread: node gis2 (2) missed(598) checkin(s)</code>

<code>[    CSSD]2010-06-22 20:51:18.963 [3342] &gt;TRACE:   clssnmPollingThread: node gis2 (2) missed(599) checkin(s)</code>

<code>[    CSSD]2010-06-22 20:51:19.963 [3342] &gt;TRACE:   clssnmPollingThread: Eviction started </code><code>for</code> <code>node gis2 (2), flags 0x0001, state 3, wt4c 0</code>

<code>/* 節點1上完成倒計時後開始驅逐節點2*/</code>

<code>節點2:</code>

<code>[    CSSD]2010-06-22 20:41:19.598 [3342] &gt;TRACE:   clssnmPollingThread: node gis1 (1) missed(2) checkin(s)</code>

<code>[    CSSD]2010-06-22 20:41:20.599 [3342] &gt;TRACE:   clssnmPollingThread: node gis1 (1) missed(3) checkin(s)</code>

<code>......................</code>

<code>[    CSSD]2010-06-22 20:51:15.871 [3342] &gt;TRACE:   clssnmPollingThread: node gis1 (1) missed(598) checkin(s)</code>

<code>[    CSSD]2010-06-22 20:51:16.871 [3342] &gt;TRACE:   clssnmPollingThread: node gis1 (1) missed(599) checkin(s)</code>

<code>[    CSSD]2010-06-22 20:51:17.878 [3342] &gt;TRACE:   clssnmPollingThread: Eviction started </code><code>for</code> <code>node gis1 (1), flags 0x0001, state 3, wt4c 0</code>

<code>/*同樣的節點2也是在10分鐘後的51分開始驅逐節點1*/</code>

<code>STEP 2: 2個節點讀取磁盤心跳資訊(clssnmReadDskHeartbeat),且認為對方節點已經down了</code>

<code>[    CSSD]2010-06-22 20:51:20.964 [3856] &gt;TRACE:   clssnmSetupAckWait: node(1) </code><code>is</code> <code>ACTIVE</code>

<code>[    CSSD]2010-06-22 20:51:20.964 [3856] &gt;TRACE:   clssnmSendVote: syncSeqNo(3)</code>

<code>[    CSSD]2010-06-22 20:51:20.964 [3856] &gt;TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)</code>

<code>[    CSSD]2010-06-22 20:51:20.965 [2057] &gt;TRACE:   clssnmSendVoteInfo: node(1) syncSeqNo(3)</code>

<code>[    CSSD]2010-06-22 20:51:21.714 [1543] &gt;TRACE:   clssnmReadDskHeartbeat: node(2) </code><code>is</code> <code>down. rcfg(3) wrtcnt(4185) LATS(1628594178) Disk lastSeqNo(4185)</code>

<code>[    CSSD]2010-06-22 20:51:21.965 [3856] &gt;TRACE:   clssnmWaitForAcks: done, msg type(13)</code>

<code>[    CSSD]2010-06-22 20:51:21.965 [3856] &gt;TRACE:   clssnmCheckDskInfo: Checking disk info...</code>

<code>[    CSSD]2010-06-22 20:51:22.718 [1543] &gt;TRACE:   clssnmReadDskHeartbeat: node(2) </code><code>is</code> <code>down. rcfg(3) wrtcnt(4186) LATS(1628595183) Disk lastSeqNo(4186)</code>

<code>[    CSSD]2010-06-22 20:51:22.964 [3342] &gt;TRACE:   clssnmPollingThread: node gis1 (1) missed(2) checkin(s)</code>

<code>[    CSSD]2010-06-22 20:51:23.722 [1543] &gt;TRACE:   clssnmReadDskHeartbeat: node(2) </code><code>is</code> <code>down. rcfg(3) wrtcnt(4187) LATS(1628596186) Disk lastSeqNo(4187)</code>

<code>&lt;span style=</code><code>"color: #ff0000;"</code><code>&gt;[ CSSD]2010-06-22 20:51:24.724 [1543] &gt;TRACE: clssnmReadDskHeartbeat: node(2) </code><code>is</code> <code>down.&lt;/span&gt;</code>

<code>rcfg(3) wrtcnt(4188) LATS(1628597189) Disk lastSeqNo(4188)</code>

<code>.............................</code>

<code>[    CSSD]2010-06-22 20:59:49.953 [1543] &gt;TRACE:   clssnmReadDskHeartbeat: node(2) </code><code>is</code> <code>down. rcfg(3) wrtcnt(4692) LATS(1629102418) Disk lastSeqNo(4692)</code>

<code>[    CSSD]2010-06-22 20:59:50.057 [3085] &gt;TRACE:   clssgmPeerDeactivate: node 2 (gis2), death 0, state 0x80000001 connstate 0xf</code>

<code>[    CSSD]2010-06-22 20:59:50.104 [1029] &gt;TRACE:   clssnm_skgxncheck: CSS daemon failed </code><code>on</code> <code>node 2</code>

<code>[    CSSD]2010-06-22 20:59:50.382 [2314] &gt;TRACE:   clssgmClientConnectMsg: </code><code>Connect</code> <code>from</code> <code>con(112a6c5b0) proc(112a5a190) pid() proto(10:2:1:1)</code>

<code>[    CSSD]2010-06-22 20:59:51.231 [3856] &gt;TRACE:   clssnmEvict: Start</code>

<code>[    CSSD]2010-06-22 20:59:51.231 [3856] &gt;TRACE:   clssnmEvict: Evicting node 2, birth 1, death 3, killme 1</code>

<code>[    CSSD]2010-06-22 20:59:51.232 [3856] &gt;TRACE:   clssnmWaitOnEvictions: Start</code>

<code>[    CSSD]2010-06-22 20:59:51.232 [3856] &gt;TRACE:   clssnmWaitOnEvictions: Node(0) down, LATS(0),timeout(1629103696)</code>

<code>[    CSSD]2010-06-22 20:59:51.232 [3856] &gt;TRACE:   clssnmWaitOnEvictions: Node(2) down, LATS(1629102418),timeout(1278)</code>

<code>[    CSSD]2010-06-22 20:59:51.232 [3856] &gt;TRACE:   clssnmSetupAckWait: Ack message type (15)</code>

<code>[    CSSD]2010-06-22 20:59:51.232 [3856] &gt;TRACE:   clssnmSetupAckWait: node(1) </code><code>is</code> <code>ACTIVE</code>

<code>[    CSSD]2010-06-22 20:59:51.232 [3856] &gt;TRACE:   clssnmSendUpdate: syncSeqNo(3)</code>

<code>[    CSSD]2010-06-22 20:59:51.232 [3856] &gt;TRACE:   clssnmWaitForAcks: Ack message type(15), ackCount(1)</code>

<code>[    CSSD]2010-06-22 20:59:51.232 [2057] &gt;TRACE:   clssnmUpdateNodeState: node 0, state (0/0) </code><code>unique</code> <code>(0/0) prevConuni(0) birth (0/0) (old/new)</code>

<code>[    CSSD]2010-06-2F1.232 [2057] &gt;TRACE:   clssnmDeactivateNode: node 0 () </code><code>left</code> <code>cluster</code>

<code>[    CSSD]2010-06-22 20:59:51.232 [2057] &gt;TRACE:   clssnmUpdateNodeState: node 1, state (3/3) </code><code>unique</code> <code>(1277207505/1277207505) prevConuni(0) birth (2/2) (old/new)</code>

<code>[    CSSD]2010-06-22 20:59:51.232 [2057] &gt;TRACE:   clssnmUpdateNodeState: node 2, state (0/0) </code><code>unique</code> <code>(1277206874/1277206874) prevConuni(1277206874) birth (1/0) (old/new)</code>

<code>[    CSSD]2010-06-22 20:59:51.232 [2057] &gt;TRACE:   clssnmDeactivateNode: node 2 (gis2) </code><code>left</code> <code>cluster</code>

<code>[    CSSD]2010-06-22 20:59:51.233 [2057] &gt;</code><code>USER</code><code>:    clssnmHandleUpdate: SYNC(3) </code><code>from</code> <code>node(1) completed</code>

<code>[    CSSD]2010-06-22 20:59:51.233 [2057] &gt;</code><code>USER</code><code>:    clssnmHandleUpdate: NODE 1 (gis1) </code><code>IS</code> <code>ACTIVE MEMBER </code><code>OF</code> <code>CLUSTER</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmReconfigThread:  started </code><code>for</code> <code>reconfig (3)</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;</code><code>USER</code><code>:    NMEVENT_RECONFIG [00][00][00][02]</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock crs_version type 2</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_gisdb type 2</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_1_gisdb type 3</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRD_2_gisdb type 3</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupOrphanMembers: cleaning up remote mbr(0) grock(ORA_CLSRD_2_gisdb) birth(1/0)</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock DBGISDB type 2</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock DGGISDB type 2</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock DAALL_DB type 2</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock EVMDMAIN type 2</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock CRSDMAIN type 2</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock IGGISDBALL type 2</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock ocr_crs type 2</code>

<code>[    CSSD]2010-06-22 20:59:51.310 [4114] &gt;TRACE:   clssgmCleanupGrocks: cleaning up grock ORA_CLSRCSN_SRV_gisdb1 type 3</code>

<code>[    CSSD]2010-06-22 20:59:51.311 [4114] &gt;TRACE:   clssgmEstablishConnections: 1 nodes </code><code>in</code> <code>cluster incarn 3</code>

<code>[    CSSD]2010-06-22 20:59:51.311 [3085] &gt;TRACE:   clssgmPeerListener: connects done (1/1)</code>

<code>[    CSSD]2010-06-22 20:59:51.311 [4114] &gt;TRACE:   clssgmEstablishMasterNode: MASTER </code><code>for</code> <code>3 </code><code>is</code> <code>node(1) birth(2)</code>

<code>[    CSSD]2010-06-22 20:59:51.311 [4114] &gt;TRACE:   clssgmChangeMasterNode: requeued 1 RPCs</code>

<code>[    CSSD]2010-06-22 20:59:51.311 [4114] &gt;TRACE:   clssgmMasterCMSync: Synchronizing </code><code>group</code><code>/lock status</code>

<code>[    CSSD]2010-06-22 20:59:51.312 [4114] &gt;TRACE:   clssgmMasterSendDBDone: </code><code>group</code><code>/lock status synchronization complete</code>

<code>[    CSSD]CLSS-3000: reconfiguration successful, incarnation 3 </code><code>with</code> <code>1 nodes</code>

<code>[    CSSD]CLSS-3001: </code><code>local</code> <code>node number 1, master node number 1</code>

<code>/* 節點1在hearbeat 8分鐘左右後認為CSS daemon failed </code><code>on</code> <code>node 2,正式認為Node 2離開了叢集,并完成了reconfiguration*/</code>

<code>[    CSSD]2010-06-22 20:51:18.892 [3856] &gt;TRACE:   clssnmSendVote: syncSeqNo(3)</code>

<code>[    CSSD]2010-06-22 20:51:18.892 [3856] &gt;TRACE:   clssnmWaitForAcks: Ack message type(13), ackCount(1)</code>

<code>[    CSSD]2010-06-22 20:51:18.892 [2057] &gt;TRACE:   clssnmSendVoteInfo: node(2) syncSeqNo(3)</code>

<code>[    CSSD]2010-06-22 20:51:19.287 [1543] &gt;TRACE:   clssnmReadDskHeartbeat: node(1) </code><code>is</code> <code>down. rcfg(3) wrtcnt(3548) LATS(351788040) Disk lastSeqNo(3548)</code>

<code>[    CSSD]2010-06-22 20:51:19.892 [3856] &gt;TRACE:   clssnmWaitForAcks: done, msg type(13)</code>

<code>[    CSSD]2010-06-22 20:51:19.892 [3856] &gt;TRACE:   clssnmCheckDskInfo: Checking disk info...</code>

<code>&lt;span style=</code><code>"color: #ff0000;"</code><code>&gt;[ CSSD]2010-06-22 20:51:20.288 [1543] &gt;TRACE: clssnmReadDskHeartbeat: node(1) </code><code>is</code> <code>down.&lt;/span&gt; rcfg(3) wrtcnt(3549) LATS(351789041) Disk lastSeqNo(3549)</code>

<code>[    CSSD]2010-06-22 20:51:21.308 [1543] &gt;TRACE:   clssnmReadDskHeartbeat: node(1) </code><code>is</code> <code>down. rcfg(3) wrtcnt(3550) LATS(351790062) Disk lastSeqNo(3550)</code>

<code>...........................</code>

<code>[    CSSD]2010-06-22 20:59:46.122 [1543] &gt;TRACE:   clssnmReadDskHeartbeat: node(1) </code><code>is</code> <code>down. rcfg(3) wrtcnt(4051) LATS(352294875) Disk lastSeqNo(4051)</code>

<code>[    CSSD]2010-06-22 20:59:46.341 [2314] &gt;TRACE:   clssgmClientConnectMsg: </code><code>Connect</code> <code>from</code> <code>con(112947c70) proc(112946f90) pid() proto(10:2:1:1)</code>

<code>[    CSSD]2010-06-22 20:59:46.355 [2314] &gt;WARNING: clssgmShutDown: Received explicit shutdown request </code><code>from</code> <code>client.</code>

<code>[    CSSD]2010-06-22 20:59:46.355 [2314] &gt;TRACE:   clssgmClientShutdown: Aborting client (112a50210) proc (112a4e3d0)</code>

<code>[    CSSD]2010-06-22 20:59:46.355 [2314] &gt;TRACE:   clssgmClientShutdown: Aborting client (112a50cd0) proc (112a4e3d0)</code>

<code>[    CSSD]2010-06-22 20:59:46.355 [2314] &gt;TRACE:   clssgmClientShutdown: Aborting client (112a536f0) proc (112a4e3d0)</code>

<code>[    CSSD]2010-06-22 20:59:46.355 [2314] &gt;TRACE:   clssgmClientShutdown: Aborting client (112a4eb90) proc (112a4eef0)</code>

<code>[    CSSD]2010-06-22 20:59:46.355 [2314] &gt;TRACE:   clssgmClientShutdown: Aborting client (112a69250) proc (112a67e10)</code>

<code>[    CSSD]2010-06-22 20:59:46.355 [2314] &gt;TRACE:   clssgmClientShutdown: Aborting client (112946050) proc (112945e50)</code>

<code>[    CSSD]2010-06-22 20:59:46.355 [2314] &gt;TRACE:   clssgmClientShutdown: waited 0 seconds </code><code>on</code> <code>6 IO capable clients</code>

<code>[    CSSD]2010-06-22 20:59:46.494 [2314] &gt;WARNING: clssgmClientShutdown: graceful shutdown completed.</code>

<code>[    CSSD]2010-06-22 20:59:47.130 [1543] &gt;TRACE:   clssnmReadDskHeartbeat: node(1) </code><code>is</code> <code>down. rcfg(3) wrtcnt(4052) LATS(352295883) Disk lastSeqNo(4052)</code>

<code>[    CSSD]2010-06-22 21:34:40.167 &gt;</code><code>USER</code><code>:    Oracle </code><code>Database</code> <code>10g CSS Release 10.2.0.1.0 Production Copyright 1996, 2004 Oracle.  </code><code>All</code> <code>rights reserved.</code>

<code>/* node2 也正确進行了heartbeat,并認為node(1) </code><code>is</code> <code>down,最後被手動關閉;之後還原了網絡故障,在21:34時CSS重新啟動*/</code>

  如果你仔細看以上日志的話,你大概會找出"Oracle Database 10g CSS Release 10.2.0.1.0"的記錄;這套RAC不是10.2.0.4的嗎,為什麼CSS還是10.2.0.1版本的呢,事後調查才發覺是安裝該套系統的施工方國内某X碼工程師在給CRS打更新檔的時候忘記運作最後的root102.sh腳本了,該腳本将更新OCR/Voting disk及實際的CRS binary檔案等,如果更新檔安裝結束後沒有運作該腳本則更新不會有任何效果,而隻會更新oraInventory中的資訊。   剛開始時哪位X碼的工程師抵死不肯承認忘記了運作腳本,而實際上在AIX 6.1上打10.2.0.4 CRS的patch是需要為oracle使用者賦特有的權限的,這一點不同于AIX 5.3上,即:  

<code>chuser capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE,CAP_NUMA_ATTACH oracle</code>

<code>/*進一步檢查*/</code>

<code>lsuser -f oracle | </code><code>grep</code> <code>capabilities</code>

<code>        </code><code>capabilities=CAP_BYPASS_RAC_VMM,CAP_PROPAGATE,CAP_NUMA_ATTACH</code>

  如果未對oracle使用者賦以上權限則運作root102.sh腳本時将報錯。另一個判斷的标志是pre10204/pre10205目錄,如果運作過root102.sh腳本的話$ORA_CRS_HOME/install目錄下會多出一個形如pre$VERSION的目錄,沒有的話一般就是沒有運作過腳本,當然也有可能是時候删除了(不建議删除)。   了解到以上資訊後對此次腦裂的追根溯源就要簡單的多了,版本10201上的CRS可以說Bug衆多的,從10201-10204期間CRS加入了不少新的參數和機制,在MOS上搜尋關鍵詞"brain split CSS"可以找到以下案例:

Hdr: 8293652 10.2.0.3 PCW 10.2.0.3 OSD PRODID-5 PORTID-46 Abstract: CSS CANNOT HANDLE SPLIT-BRAIN AND DB INSTANCE RECEIVES ORA-29740 PROBLEM: -------- config: 2-node RAC: Node1 (pdb01) and Node2 (pdb02) There's no time difference between two nodes. pdb02 got ORA-29740 and terminated at "Tue Feb 17 12:13:06 2009" ORA-29740 occured with reason 1. After ORA-29740 happened, the instance won't be able to start until rebooting OS. After rebooting OS, everything was fine and instances were up. DIAGNOSTIC ANALYSIS: -------------------- clssnmReadDskHeartbeat: node(2) is down. rcfg(8) wrtcnt(2494425) LATS(1205488794) Disk lastSeqNo(2494425) nodes clssgmMasterSendDBDone: group/lock status synchronization complete nodes WORKAROUND: ----------- none RELATED BUGS: ------------- REPRODUCIBILITY: ---------------- once at ct's env. TEST CASE: ---------- STACK TRACE: ------------ SUPPORTING INFORMATION: ----------------------- 24 HOUR CONTACT INFORMATION FOR P1 BUGS: ---------------------------------------- DIAL-IN INFORMATION: -------------------- IMPACT DATE: ------------ Does ct apply any CRS(bundle) patch ? When problem happen, cssd can't connect each other via interconnect, but both cssd can do heartbeat to voting disk. However, both cssd consider that "I'm survivor". Looking into node 1 cssd. * 12:02:30.856 - Initiated sync [ CSSD]2009-02-17 12:02:30.856 [1262557536] &gt;TRACE: clssnmDoSyncUpdate: Initiating sync 7 [ CSSD]2009-02-17 12:02:30.856 [1262557536] &gt;TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms * Checking voting disk, and find node2 is still voting and living. [ CSSD]2009-02-17 12:02:30.874 [1262557536] &gt;TRACE: clssnmCheckDskInfo: Checking disk info... [ CSSD]2009-02-17 12:02:30.874 [1262557536] &gt;TRACE: clssnmCheckDskInfo: node(2) timeout(20) state_network(5) state_disk(3) misstime(0) [ CSSD]2009-02-17 12:02:31.878 [1262557536] &gt;TRACE: clssnmCheckDskInfo: node(2) disk HB found, network state 5, disk state(3) misstime(1010) * Compared cluster size and confirmed it can survive. [ CSSD]2009-02-17 12:02:34.885 [1262557536] &gt;TRACE: clssnmCheckDskInfo: node 2, iz-pdb02, state 5 with leader 2 has smaller cluster size 1; my cluster size 1 with leader 1 * Then finished [ CSSD]2009-02-17 12:02:34.886 [1262557536] &gt;TRACE: clssnmDoSyncUpdate: Sync Complete! *** 03/08/09 11:23 pm *** Looking into node 2 cssd log. * 12:02:20.647 - initiated sync protocol [ CSSD]2009-02-17 12:02:20.647 [1262557536] &gt;TRACE: clssnmDoSyncUpdate: Initiating sync 7 [ CSSD]2009-02-17 12:02:20.647 [1262557536] &gt;TRACE: clssnmDoSyncUpdate: diskTimeout set to (57000)ms * Checking disk and find node1 does not do disk heartbeart for 59690 ms. it would have waited for misscount and considered node 1 is dead [ CSSD]2009-02-17 12:02:22.285 [1262557536] &gt;TRACE: clssnmCheckDskInfo: Checking disk info... [ CSSD]2009-02-17 12:02:22.285 [1262557536] &gt;TRACE: clssnmCheckDskInfo: node(1) timeout(59690) state_network(5) state_disk(3) misstime(61000) * node2 is the only active member of cluster, finished. [ CSSD]2009-02-17 12:02:22.723 [1262557536] &gt;TRACE: clssnmDoSyncUpdate: Sync Complete! *** 03/08/09 11:45 pm *** So, strange point is, node 2 cssd says node 1 cssd didn't do disk heartbeat for 60 seconds. Looking into node1 cssd log just before initiating sync. We see 87sec gap. ----------------------------------- [ CSSD]2009-02-17 12:00:19.354 [1199618400] &gt;TRACE: clssgmClientConnectMsg: Connect from con(0x784d80) proc(0x7749b0) pid(14746) proto(10:2:1:1) [ CSSD]2009-02-17 12:00:33.338 [1199618400] &gt;TRACE: clssgmClientConnectMsg: Connect from con(0x7d8620) proc(0x75cfb0) pid() proto(10:2:1:1) [ CSSD]2009-02-17 12:01:03.688 [1199618400] &gt;TRACE: clssgmClientConnectMsg: Connect from con(0x76a390) proc(0x75cfb0) pid(13634) proto(10:2:1:1) [ CSSD]2009-02-17 12:02:30.855 [1168148832] &gt;WARNING: clssnmDiskPMT: sltscvtimewait timeout (69200) [ CSSD]2009-02-17 12:02:30.855 [1189128544] &gt;WARNING: clssnmeventhndlr: Receive failure with node 2 (iz-pdb02), state 3, con(0x72b980), probe((nil)), rc=10 [ CSSD]2009-02-17 12:02:30.855 [1189128544] &gt;TRACE: clssnmDiscHelper: iz-pdb02, node(2) connection failed, con (0x72b980), probe((nil)) [ CSSD]2009-02-17 12:02:30.856 [1262557536] &gt;TRACE: clssnmDoSyncUpdate: Initiating sync 7 As an interesting point, clssgmClientConnectMsg does not show message, but nm polling thread/disk ping thread does not warn timeout. (Usually it should write message first at 50% of timeout = 30 sec) And "sltscvtimewait timeout (69200)" message means, DiskPingMonitor thread does not run for a 69200 ms whereas it just wants to sleep 1 second. These suggest, cssd does not scheduled about 70 seconds on node 1. I don't see any log from DiskPingThread, but I assume it is suspended at some point also, and back to work after 70 seconds. Please check OS message file to see any interesting error is recorded. To prevent this issue, keep watching OS performance to see if any extreme high load does not happen. Recommended solution it to go to 10.2.0.4 and use oprocd so that we can expect oprocd kill node1 in such case.

    上述案例同樣是在"cssd can't connect each other via interconnect"的狀況下出現了"I'm survivor"的腦裂問題,MOS的建議是更新到10204後oprocd程序可以阻止這樣的慘劇發生。   該問題最後通過更新到10.2.0.5解決了,這個case告訴我們在中國的it大環境内,有時候我們不得不親力親為地關心每一個細節,就拿這次來說我一開始也沒發現更新沒完成的情況,後來還是同僚提醒了我;因為這是一個非常低級錯誤,如果施工方的X碼工程師仔仔細細地按照他們下發的文檔按部就班亦或者能留意一下更新時的圖形視窗中的說明的話,這個問題都不會發生!而實際上不僅僅是此套系統,連帶着其他2套系統也是這位X碼工程師安裝更新的,這幾套系統在之後的故障測試時都發現了同樣的問題。 事實告訴我們,細節決定成敗!

本文轉自maclean_007 51CTO部落格,原文連結:http://blog.51cto.com/maclean/1277131