天天看點

記一次Rac節點故障處理及恢複(上篇)

某系統oracle資料庫 rac有2個節點,節點2本地硬碟故障,/u01目錄無法打開,導緻節點2 grid和oracle相關軟體全部丢失。下面記錄了恢複節點2 的故障處理及恢複過程。本文章同樣适用于删除和添加節點。

  說  明  

生産環境的rac有2個節點,整體步驟來自oracle 官方文檔:https://docs.oracle.com/cd/e11882_01/rac.112/e41960/adddelunix.htm#racad7358環境情況如下:

節點名稱

資料庫執行個體名

作業系統

資料庫版本

異常情況

wbtdb1/wbtdb2

linux 6.x

oracle11.2.0.4

grid:  grid_home 名稱為 oracle_home ,路徑為:/u01/app/11.2.0/grid

oracle: oracle_home 路徑為:/u01/app/oracle/product/11.2.0/dbhome_1

grid的base和home

<code>[root@wbtdb1 ~]# su - grid [grid@wbtdb1 ~]$ echo $oracle_home /u01/11.2.0/grid [grid@wbtdb1 ~]$ echo $oracle_base /u01/app/oracle</code>

oracle的base和home

<code>[root@wbtdb2 ~]# su - oracle [oracle@wbtdb2 ~]$ echo $oracle_home /u01/app/oracle/product/11.2.0/db_1 [oracle@wbtdb2 ~]$ echo $oracle_base /u01/app/oracle [oracle@wbtdb2 ~]</code>

檢視節點2狀态

從節點2來看,軟體已經沒有了,任何oracle相關指令都無法執行了,因為oracle相關軟體目錄已損壞。

<code>[grid@wbtdb2 ~]$ crsctl stat res -t -bash: crsctl: command not found [grid@wbtdb2 ~]$</code>但這時候從節點1來看,狀态還是正常的:

<code>[grid@wbtdb1 ~]$ crsctl stat res -t -------------------------------------------------------------------------------- name target state server state_details -------------------------------------------------------------------------------- local resources -------------------------------------------------------------------------------- ora.archdg.dg                online online wbtdb1                online online wbtdb2 ora.datadg.dg                online online wbtdb1                online online wbtdb2 ora.listener.lsnr                online online wbtdb1                online intermediate wbtdb2 ora.ocrvoting.dg                online online wbtdb1                online online wbtdb2 ora.asm                online online wbtdb1 started                online online wbtdb2 started ora.gsd                offline offline wbtdb1                offline offline wbtdb2 ora.net1.network                online online wbtdb1                online online wbtdb2 ora.ons                online online wbtdb1                online intermediate wbtdb2 ora.registry.acfs                online online wbtdb1                online online wbtdb2 -------------------------------------------------------------------------------- cluster resources -------------------------------------------------------------------------------- ora.listener_scan1.lsnr       1 online online wbtdb1 ora.cvu       1 online online wbtdb2 ora.oc4j       1 online intermediate wbtdb2 ora.scan1.vip       1 online online wbtdb1 ora.wbtdb.db       1 online online wbtdb1 open       2 online online wbtdb2 open ora.wbtdb1.vip       1 online online wbtdb1 ora.wbtdb2.vip       1 online online wbtdb2 [grid@wbtdb1 ~]$</code>

操作大緻步驟

删除節點2 oracle執行個體并更新oracle_home資料庫清單

更新grid_home叢集清單

停止和删除監聽并從cluster中删除節點2

清除vip資訊并删除節點2

從節點1添加節點2

節點2伺服器壞掉,從節點1上,oracle使用者下執行dbca

<code>[root@wbtdb1 ~]# xhost + access control disabled, clients can connect from any host [root@wbtdb1 ~]# export display=192.168.1.234:0.0 [root@wbtdb1 ~]# su - oracle [oracle@wbtdb1 ~]$ xhost + access control disabled, clients can connect from any host xhost: must be on local machine to enable or disable access control. [oracle@wbtdb1 ~]$  [oracle@wbtdb1 ~]$ dbca</code>

大緻步驟如下:

記一次Rac節點故障處理及恢複(上篇)
記一次Rac節點故障處理及恢複(上篇)
記一次Rac節點故障處理及恢複(上篇)
記一次Rac節點故障處理及恢複(上篇)
記一次Rac節點故障處理及恢複(上篇)
記一次Rac節點故障處理及恢複(上篇)
記一次Rac節點故障處理及恢複(上篇)
記一次Rac節點故障處理及恢複(上篇)

方法二: 靜默删除節點2 oracle執行個體

<code>dbca -silent -deleteinstance [-nodelist node_name] -gdbname gdb_name -instancename instance_name -sysdbausername sysdba -sysdbapassword password -gdbname gdb_name 這裡的gdb_name是global_name select * from global_name; 可以檢視該值 node_name   是删除節點名 gdb_name    是全局資料庫名 instance_name 是删除的執行個體名 sysdba     是擁有sysdba權限的oracle使用者名稱 password    是sysdba使用者的密碼 [oracle@wbtdb1 ~]$ dbca -silent -deleteinstance -nodelist wbtdb2 -gdbname wbtdb -instancename wbtdb2 -sysdbausername sys -sysdbapassword oracle deleting instance 1% complete 2% complete 6% complete 13% complete 20% complete 26% complete 33% complete 40% complete 46% complete 53% complete 60% complete 66% complete completing instance management. 100% complete look at the log file "/u01/app/oracle/cfgtoollogs/dbca/wbtdb.log" for further details. -gdbname wbtdb 這裡的wbtdb是global_name select * from global_name; 可以檢視該值 node_name  是删除節點名 gdb_name   是全局資料庫名 instance   是删除的執行個體名 sysdba     是擁有sysdba權限的oracle使用者名稱 passwd     是sysdba使用者的密碼</code>

節點1 切換oracle使用者下:

檢視活動的執行個體:

<code>set line 200 select thread#,status,instance from v$thread;    thread# status instance ---------- ----------- ---------------      1  open wbtdb1</code>

如果還有節點2的redo log ,請使用以下指令:

<code>alter database disable thread 2;</code>

驗證ocr中 資料庫資訊,文法如下:

<code>srvctl config database -d db_unique_name</code>

例如:

<code>[oracle@wbtdb1 ~]$ srvctl config database -d wbtdb database unique name: wbtdb database name: wbtdb oracle home: /u01/app/oracle/product/11.2.0/db_1 oracle user: oracle spfile: +datadg/wbtdb/spfilewbtdb.ora domain: start options: open stop options: immediate database role: primary management policy: automatic server pools: wbtdb database instances: wbtdb1 disk groups: datadg mount point paths: services: type: rac database is administrator managed</code>

2.1  停止和删除監聽

節點1,grid使用者下:

<code>[grid@wbtdb1 ~]$ srvctl status listener -l listener -n wbtdb2 listener listener is enabled on node(s): wbtdb2 listener listener is running on node(s): wbtdb2</code>執行以下:

<code>[grid@wbtdb1 ~]$ srvctl disable listener -l listener -n wbtdb2 [grid@wbtdb1 ~]$ srvctl stop listener -l listener -n wbtdb2 prcr-1014 : failed to stop resource ora.listener.lsnr prcr-1065 : failed to stop resource ora.listener.lsnr crs-2675: stop of 'ora.listener.lsnr' on 'wbtdb2' failed crs-2678: 'ora.listener.lsnr' on 'wbtdb2' has experienced an unrecoverable failure crs-0267: human intervention required to resume its availability. [grid@wbtdb1 ~]$ srvctl status listener -l listener -n wbtdb2 listener listener is disabled on node(s): wbtdb2 listener listener is not running on node(s): wbtdb2</code>2.2   更新grid_home叢集清單

在所有的保留節點oracle 使用者 $oracle_home/oui/bin 下運作以下指令來更新這些節點的清單,并指定逗号分隔的其餘節點名稱清單(正常,異常都執行):(這個步驟的作用就是更新保留節點叢集資訊清單,節點2雖然軟體目錄都已經删除掉了,但是從節點1查詢,節點2的asm執行個體等還是顯示正常的,這個指令的目的就是把不正常的顯示更新掉)官方指令:

所有的保留節點執行,我的就隻剩一個節點,例如我在節點1操作(grid使用者下操作):

<code>cd $oracle_home/oui/bin [grid@wbtdb1 bin]$ ./runinstaller -updatenodelist oracle_home=$oracle_home "cluster_nodes={wbtdb1}" starting oracle universal installer... checking swap space: must be greater than 500 mb. actual 4095 mb passed the inventory pointer is located at /etc/orainst.loc the inventory is located at /u01/app/orainventory 'updatenodelist' was successful.</code>

來自官方文檔:https://docs.oracle.com/cd/e11882_01/rac.112/e41959/adddelclusterware.htm#cwadd90992

3.1  确認節點狀态是否是unpinned

root 或者grid 執行

<code>[grid@wbtdb1 bin]$ olsnodes -s -t wbtdb1 active unpinned wbtdb2 active unpinned [grid@wbtdb1 bin]$</code>如果要删除的節點為 pinned 狀态,請root手工執行以下指令。

官方文檔如下:

記一次Rac節點故障處理及恢複(上篇)

提别提醒:很多網絡上資料不正确,如果unpinned(不固定的),根本不需要執行unpin 指令

本次故障處理不需要執行以下指令。

<code>crsctl unpin css -n wbtdb2 /u01/11.2.0/grid/bin/crsctl unpin css -n wbtdb2 crsctl status res -t</code>

首先停止節點2的vip:(vip_name 為/etc/hosts裡的名稱 rac2-vip)

root使用者:

<code>[root@wbtdb1 ~]# /u01/11.2.0/grid/bin/srvctl stop vip -i wbtdb2-vip -f [root@wbtdb1 ~]# 清除vip資訊:srvctl remove vip -i vip_name -f [root@wbtdb1 ~]# /u01/11.2.0/grid/bin/srvctl remove vip -i wbtdb2-vip -f 檢視vip: /u01/11.2.0/grid/bin/crsctl status res -t</code>隻有節點1的vip。

<code>[grid@wbtdb1 ~]$ olsnodes -s -t wbtdb1 active unpinned wbtdb2 active unpinned</code>

正常節點1上root執行删除節點2指令:

crs-4661: node wbtdb2 successfully deleted.

驗證:

<code>[grid@wbtdb1 ~]$ olsnodes -s -t wbtdb1 active unpinned</code>

注:如果節點2删除失敗報crs-4658、crs-4000錯誤,可以将節點2 crs相關程序殺掉即可

<code>[root@wbtdb1 ~]# /u01/11.2.0/grid/bin/crsctl delete node -n wbtdb2 crs-4658: the clusterware stack on node wbtdb2 is not completely down. crs-4000: command delete failed, or completed with errors.</code>

發現删除失敗,檢查crs啟動情況:

<code>[root@wbtdb1 bin]# ./crsctl stat res -t -------------------------------------------------------------------------------- name target state server state_details -------------------------------------------------------------------------------- local resources -------------------------------------------------------------------------------- ora.archdg.dg                online online wbtdb1                online online wbtdb2 ora.datadg.dg                online online wbtdb1                online online wbtdb2 ora.listener.lsnr                online online wbtdb1                offline unknown wbtdb2 ora.ocrvoting.dg                online online wbtdb1                online online wbtdb2 ora.asm                online online wbtdb1 started                online online wbtdb2 started ora.gsd                offline offline wbtdb1                offline offline wbtdb2 ora.net1.network                online online wbtdb1                online online wbtdb2 ora.ons                online online wbtdb1                offline unknown wbtdb2 ora.registry.acfs                online online wbtdb1                online online wbtdb2 -------------------------------------------------------------------------------- cluster resources -------------------------------------------------------------------------------- ora.listener_scan1.lsnr       1 online online wbtdb1 ora.cvu       1 online online wbtdb2 ora.oc4j       1 online intermediate wbtdb2 ora.scan1.vip       1 online online wbtdb1 ora.wbtdb.db       1 online online wbtdb1 open ora.wbtdb1.vip       1 online online wbtdb1 [root@wbtdb1 bin]#</code>

發現節點2的資源都已經停止了。但是檢視節點2 crs相關程序都還在,那是因為我們是crs運作正常是删除的軟體目錄,軟體雖然删除了,但是程序還未清掉。清掉crs程序後節點2就可以删除了

<code>[root@wbtdb2 ~]# ps -ef|grep d.bin root      2546     1  4 13:10 ? 00:03:22 /u01/11.2.0/grid/bin/ohasd.bin reboot grid      3632     1  0 13:11 ? 00:00:06 /u01/11.2.0/grid/bin/oraagent.bin grid      3643     1  0 13:11 ? 00:00:00 /u01/11.2.0/grid/bin/mdnsd.bin grid      3673     1  0 13:11 ? 00:00:00 /u01/11.2.0/grid/bin/gpnpd.bin grid      3683     1  0 13:11 ? 00:00:07 /u01/11.2.0/grid/bin/gipcd.bin root      3685     1  0 13:11 ? 00:00:06 /u01/11.2.0/grid/bin/orarootagent.bin root      3698     1  1 13:11 ? 00:00:46 /u01/11.2.0/grid/bin/osysmond.bin root      3717     1  0 13:11 ? 00:00:02 /u01/11.2.0/grid/bin/cssdmonitor root      3740     1  0 13:11 ? 00:00:02 /u01/11.2.0/grid/bin/cssdagent grid      3751     1  0 13:11 ? 00:00:09 /u01/11.2.0/grid/bin/ocssd.bin root      3974     1  0 13:11 ? 00:00:07 /u01/11.2.0/grid/bin/octssd.bin reboot grid      3997     1  0 13:11 ? 00:00:07 /u01/11.2.0/grid/bin/evmd.bin root      4408     1 30 13:12 ? 00:22:08 /u01/11.2.0/grid/bin/crsd.bin reboot grid      4484  3997  0 13:12 ? 00:00:00 /u01/11.2.0/grid/bin/evmlogger.bin -o /u01/11.2.0/grid/evm/log/evmlogger.info -l /u01/11.2.0/grid/evm/log/evmlogger.log grid      4519     1 27 13:12 ? 00:19:39 /u01/11.2.0/grid/bin/oraagent.bin root      4525     1 22 13:12 ? 00:16:20 /u01/11.2.0/grid/bin/orarootagent.bin grid      4712     1  0 13:12 ? 00:00:00 /u01/11.2.0/grid/bin/scriptagent.bin grid      4814     1  0 13:12 ? 00:00:00 /u01/11.2.0/grid/bin/tnslsnr listener -inherit root     23792  5354  0 14:24 pts/0    00:00:00 grep d.bin [root@wbtdb2 ~]# kill -9 2546 3632 3643 3673 3683 3685 3698 3717 3740 3751 3997 3974 4519 4525 4712 4814 [root@wbtdb2 ~]# ps -ef|grep d.bin root     30270  5354  0 14:25 pts/0    00:00:00 grep d.bin</code>

3.4  更新叢集節點資訊

grid使用者在任何運作正常,所有保留的節點上運作以下指令:

<code>[grid@rac11g1 ~]$ cd $oracle_home/oui/bin [grid@rac11g1 bin]$ echo $oracle_home /u01/11.2.0/grid $ ./runinstaller -updatenodelist oracle_home=grid_home "cluster_nodes={remaining_nodes_list}" crs=true -silent</code>例如:

操作如下:

<code>[grid@wbtdb1 bin]$ /u01/11.2.0/grid/oui/bin/runinstaller -updatenodelist oracle_home=$oracle_home "cluster_nodes={wbtdb1}" crs=true -silent starting oracle universal installer... checking swap space: must be greater than 500 mb. actual 4095 mb passed the inventory pointer is located at /etc/orainst.loc the inventory is located at /u01/app/orainventory 'updatenodelist' was successful.</code>

--檢查:

<code>[grid@wbtdb1 bin]$ olsnodes -s -t wbtdb1 active unpinned [grid@wbtdb1 bin]$</code>

<code>[grid@wbtdb1 bin]$ cluvfy stage -post nodedel -n wbtdb2 -verbose performing post-checks for node removal checking crs integrity... clusterware version consistency passed the oracle clusterware is healthy on node "wbtdb1" crs integrity check passed result: node removal check passed post-check for node removal was successful.  [grid@racdb1 ~]$ crsctl status resource -t [grid@wbtdb1 ~]$ crsctl status resource -t -------------------------------------------------------------------------------- name           target state server                   state_details -------------------------------------------------------------------------------- local resources -------------------------------------------------------------------------------- ora.archdg.dg                online  online       wbtdb1 ora.datadg.dg                online  online       wbtdb1 ora.listener.lsnr                online  online       wbtdb1 ora.ocrvoting.dg                online  online       wbtdb1 ora.asm                online  online       wbtdb1 started ora.gsd                offline offline      wbtdb1 ora.net1.network                online  online       wbtdb1 ora.ons                online  online       wbtdb1 ora.registry.acfs                online  online       wbtdb1 -------------------------------------------------------------------------------- cluster resources -------------------------------------------------------------------------------- ora.listener_scan1.lsnr       1        online  online       wbtdb1 ora.cvu       1        online  online       wbtdb1 ora.oc4j       1        online  online       wbtdb1 ora.scan1.vip       1        online  online       wbtdb1 ora.wbtdb.db       1        online  online       wbtdb1 open ora.wbtdb1.vip       1        online  online       wbtdb1 [grid@wbtdb1 ~]$</code>

至此叢集中的節點2的資訊完全清除完畢!

後面可以自行驗證保留的叢集資源,以及執行個體狀态是否正常。