記一次Rac節點故障處理及恢複（上篇）

某系統oracle資料庫 rac有2個節點，節點2本地硬碟故障，/u01目錄無法打開，導緻節點2 grid和oracle相關軟體全部丢失。下面記錄了恢複節點2 的故障處理及恢複過程。本文章同樣适用于删除和添加節點。

說明

生産環境的rac有2個節點，整體步驟來自oracle 官方文檔：https://docs.oracle.com/cd/e11882_01/rac.112/e41960/adddelunix.htm#racad7358環境情況如下：

節點名稱

資料庫執行個體名

作業系統

資料庫版本

異常情況

wbtdb1/wbtdb2

linux 6.x

oracle11.2.0.4

grid: grid_home 名稱為 oracle_home ,路徑為：/u01/app/11.2.0/grid

oracle: oracle_home 路徑為：/u01/app/oracle/product/11.2.0/dbhome_1

grid的base和home

<code>[root@wbtdb1 ~]# su - grid [grid@wbtdb1 ~]$ echo $oracle_home /u01/11.2.0/grid [grid@wbtdb1 ~]$ echo $oracle_base /u01/app/oracle</code>

oracle的base和home

<code>[root@wbtdb2 ~]# su - oracle [oracle@wbtdb2 ~]$ echo $oracle_home /u01/app/oracle/product/11.2.0/db_1 [oracle@wbtdb2 ~]$ echo $oracle_base /u01/app/oracle [oracle@wbtdb2 ~]</code>

檢視節點2狀态

從節點2來看，軟體已經沒有了，任何oracle相關指令都無法執行了,因為oracle相關軟體目錄已損壞。

<code>[grid@wbtdb2 ~]$ crsctl stat res -t -bash: crsctl: command not found [grid@wbtdb2 ~]$</code>但這時候從節點1來看，狀态還是正常的：

<code>[grid@wbtdb1 ~]$ crsctl stat res -t -------------------------------------------------------------------------------- name target state server state_details -------------------------------------------------------------------------------- local resources -------------------------------------------------------------------------------- ora.archdg.dg online online wbtdb1 online online wbtdb2 ora.datadg.dg online online wbtdb1 online online wbtdb2 ora.listener.lsnr online online wbtdb1 online intermediate wbtdb2 ora.ocrvoting.dg online online wbtdb1 online online wbtdb2 ora.asm online online wbtdb1 started online online wbtdb2 started ora.gsd offline offline wbtdb1 offline offline wbtdb2 ora.net1.network online online wbtdb1 online online wbtdb2 ora.ons online online wbtdb1 online intermediate wbtdb2 ora.registry.acfs online online wbtdb1 online online wbtdb2 -------------------------------------------------------------------------------- cluster resources -------------------------------------------------------------------------------- ora.listener_scan1.lsnr 1 online online wbtdb1 ora.cvu 1 online online wbtdb2 ora.oc4j 1 online intermediate wbtdb2 ora.scan1.vip 1 online online wbtdb1 ora.wbtdb.db 1 online online wbtdb1 open 2 online online wbtdb2 open ora.wbtdb1.vip 1 online online wbtdb1 ora.wbtdb2.vip 1 online online wbtdb2 [grid@wbtdb1 ~]$</code>

操作大緻步驟

删除節點2 oracle執行個體并更新oracle_home資料庫清單

更新grid_home叢集清單

停止和删除監聽并從cluster中删除節點2

清除vip資訊并删除節點2

從節點1添加節點2

節點2伺服器壞掉，從節點1上，oracle使用者下執行dbca

<code>[root@wbtdb1 ~]# xhost + access control disabled, clients can connect from any host [root@wbtdb1 ~]# export display=192.168.1.234:0.0 [root@wbtdb1 ~]# su - oracle [oracle@wbtdb1 ~]$ xhost + access control disabled, clients can connect from any host xhost: must be on local machine to enable or disable access control. [oracle@wbtdb1 ~]$ [oracle@wbtdb1 ~]$ dbca</code>

大緻步驟如下：

方法二：靜默删除節點2 oracle執行個體

<code>dbca -silent -deleteinstance [-nodelist node_name] -gdbname gdb_name -instancename instance_name -sysdbausername sysdba -sysdbapassword password -gdbname gdb_name 這裡的gdb_name是global_name select * from global_name; 可以檢視該值 node_name 是删除節點名 gdb_name 是全局資料庫名 instance_name 是删除的執行個體名 sysdba 是擁有sysdba權限的oracle使用者名稱 password 是sysdba使用者的密碼 [oracle@wbtdb1 ~]$ dbca -silent -deleteinstance -nodelist wbtdb2 -gdbname wbtdb -instancename wbtdb2 -sysdbausername sys -sysdbapassword oracle deleting instance 1% complete 2% complete 6% complete 13% complete 20% complete 26% complete 33% complete 40% complete 46% complete 53% complete 60% complete 66% complete completing instance management. 100% complete look at the log file "/u01/app/oracle/cfgtoollogs/dbca/wbtdb.log" for further details. -gdbname wbtdb 這裡的wbtdb是global_name select * from global_name; 可以檢視該值 node_name 是删除節點名 gdb_name 是全局資料庫名 instance 是删除的執行個體名 sysdba 是擁有sysdba權限的oracle使用者名稱 passwd 是sysdba使用者的密碼</code>

節點1 切換oracle使用者下：

檢視活動的執行個體：

<code>set line 200 select thread#,status,instance from v$thread; thread# status instance ---------- ----------- --------------- 1 open wbtdb1</code>

如果還有節點2的redo log ,請使用以下指令：

<code>alter database disable thread 2;</code>

驗證ocr中資料庫資訊，文法如下：

<code>srvctl config database -d db_unique_name</code>

例如：

<code>[oracle@wbtdb1 ~]$ srvctl config database -d wbtdb database unique name: wbtdb database name: wbtdb oracle home: /u01/app/oracle/product/11.2.0/db_1 oracle user: oracle spfile: +datadg/wbtdb/spfilewbtdb.ora domain: start options: open stop options: immediate database role: primary management policy: automatic server pools: wbtdb database instances: wbtdb1 disk groups: datadg mount point paths: services: type: rac database is administrator managed</code>

2.1 停止和删除監聽

節點1，grid使用者下：

<code>[grid@wbtdb1 ~]$ srvctl status listener -l listener -n wbtdb2 listener listener is enabled on node(s): wbtdb2 listener listener is running on node(s): wbtdb2</code>執行以下：

<code>[grid@wbtdb1 ~]$ srvctl disable listener -l listener -n wbtdb2 [grid@wbtdb1 ~]$ srvctl stop listener -l listener -n wbtdb2 prcr-1014 : failed to stop resource ora.listener.lsnr prcr-1065 : failed to stop resource ora.listener.lsnr crs-2675: stop of 'ora.listener.lsnr' on 'wbtdb2' failed crs-2678: 'ora.listener.lsnr' on 'wbtdb2' has experienced an unrecoverable failure crs-0267: human intervention required to resume its availability. [grid@wbtdb1 ~]$ srvctl status listener -l listener -n wbtdb2 listener listener is disabled on node(s): wbtdb2 listener listener is not running on node(s): wbtdb2</code>2.2 更新grid_home叢集清單

在所有的保留節點oracle 使用者 $oracle_home/oui/bin 下運作以下指令來更新這些節點的清單，并指定逗号分隔的其餘節點名稱清單(正常，異常都執行)：（這個步驟的作用就是更新保留節點叢集資訊清單，節點2雖然軟體目錄都已經删除掉了，但是從節點1查詢，節點2的asm執行個體等還是顯示正常的，這個指令的目的就是把不正常的顯示更新掉）官方指令：

所有的保留節點執行，我的就隻剩一個節點，例如我在節點1操作（grid使用者下操作）：

<code>cd $oracle_home/oui/bin [grid@wbtdb1 bin]$ ./runinstaller -updatenodelist oracle_home=$oracle_home "cluster_nodes={wbtdb1}" starting oracle universal installer... checking swap space: must be greater than 500 mb. actual 4095 mb passed the inventory pointer is located at /etc/orainst.loc the inventory is located at /u01/app/orainventory 'updatenodelist' was successful.</code>

來自官方文檔：https://docs.oracle.com/cd/e11882_01/rac.112/e41959/adddelclusterware.htm#cwadd90992

3.1 确認節點狀态是否是unpinned

root 或者grid 執行

<code>[grid@wbtdb1 bin]$ olsnodes -s -t wbtdb1 active unpinned wbtdb2 active unpinned [grid@wbtdb1 bin]$</code>如果要删除的節點為 pinned 狀态，請root手工執行以下指令。

官方文檔如下：

提别提醒：很多網絡上資料不正确，如果unpinned（不固定的），根本不需要執行unpin 指令

本次故障處理不需要執行以下指令。

<code>crsctl unpin css -n wbtdb2 /u01/11.2.0/grid/bin/crsctl unpin css -n wbtdb2 crsctl status res -t</code>

首先停止節點2的vip：（vip_name 為/etc/hosts裡的名稱 rac2-vip）

root使用者：

<code>[root@wbtdb1 ~]# /u01/11.2.0/grid/bin/srvctl stop vip -i wbtdb2-vip -f [root@wbtdb1 ~]# 清除vip資訊：srvctl remove vip -i vip_name -f [root@wbtdb1 ~]# /u01/11.2.0/grid/bin/srvctl remove vip -i wbtdb2-vip -f 檢視vip: /u01/11.2.0/grid/bin/crsctl status res -t</code>隻有節點1的vip。

<code>[grid@wbtdb1 ~]$ olsnodes -s -t wbtdb1 active unpinned wbtdb2 active unpinned</code>

正常節點1上root執行删除節點2指令：

crs-4661: node wbtdb2 successfully deleted.

驗證：

<code>[grid@wbtdb1 ~]$ olsnodes -s -t wbtdb1 active unpinned</code>

注：如果節點2删除失敗報crs-4658、crs-4000錯誤，可以将節點2 crs相關程序殺掉即可

<code>[root@wbtdb1 ~]# /u01/11.2.0/grid/bin/crsctl delete node -n wbtdb2 crs-4658: the clusterware stack on node wbtdb2 is not completely down. crs-4000: command delete failed, or completed with errors.</code>

發現删除失敗，檢查crs啟動情況：

<code>[root@wbtdb1 bin]# ./crsctl stat res -t -------------------------------------------------------------------------------- name target state server state_details -------------------------------------------------------------------------------- local resources -------------------------------------------------------------------------------- ora.archdg.dg online online wbtdb1 online online wbtdb2 ora.datadg.dg online online wbtdb1 online online wbtdb2 ora.listener.lsnr online online wbtdb1 offline unknown wbtdb2 ora.ocrvoting.dg online online wbtdb1 online online wbtdb2 ora.asm online online wbtdb1 started online online wbtdb2 started ora.gsd offline offline wbtdb1 offline offline wbtdb2 ora.net1.network online online wbtdb1 online online wbtdb2 ora.ons online online wbtdb1 offline unknown wbtdb2 ora.registry.acfs online online wbtdb1 online online wbtdb2 -------------------------------------------------------------------------------- cluster resources -------------------------------------------------------------------------------- ora.listener_scan1.lsnr 1 online online wbtdb1 ora.cvu 1 online online wbtdb2 ora.oc4j 1 online intermediate wbtdb2 ora.scan1.vip 1 online online wbtdb1 ora.wbtdb.db 1 online online wbtdb1 open ora.wbtdb1.vip 1 online online wbtdb1 [root@wbtdb1 bin]#</code>

發現節點2的資源都已經停止了。但是檢視節點2 crs相關程序都還在，那是因為我們是crs運作正常是删除的軟體目錄，軟體雖然删除了，但是程序還未清掉。清掉crs程序後節點2就可以删除了

<code>[root@wbtdb2 ~]# ps -ef|grep d.bin root 2546 1 4 13:10 ? 00:03:22 /u01/11.2.0/grid/bin/ohasd.bin reboot grid 3632 1 0 13:11 ? 00:00:06 /u01/11.2.0/grid/bin/oraagent.bin grid 3643 1 0 13:11 ? 00:00:00 /u01/11.2.0/grid/bin/mdnsd.bin grid 3673 1 0 13:11 ? 00:00:00 /u01/11.2.0/grid/bin/gpnpd.bin grid 3683 1 0 13:11 ? 00:00:07 /u01/11.2.0/grid/bin/gipcd.bin root 3685 1 0 13:11 ? 00:00:06 /u01/11.2.0/grid/bin/orarootagent.bin root 3698 1 1 13:11 ? 00:00:46 /u01/11.2.0/grid/bin/osysmond.bin root 3717 1 0 13:11 ? 00:00:02 /u01/11.2.0/grid/bin/cssdmonitor root 3740 1 0 13:11 ? 00:00:02 /u01/11.2.0/grid/bin/cssdagent grid 3751 1 0 13:11 ? 00:00:09 /u01/11.2.0/grid/bin/ocssd.bin root 3974 1 0 13:11 ? 00:00:07 /u01/11.2.0/grid/bin/octssd.bin reboot grid 3997 1 0 13:11 ? 00:00:07 /u01/11.2.0/grid/bin/evmd.bin root 4408 1 30 13:12 ? 00:22:08 /u01/11.2.0/grid/bin/crsd.bin reboot grid 4484 3997 0 13:12 ? 00:00:00 /u01/11.2.0/grid/bin/evmlogger.bin -o /u01/11.2.0/grid/evm/log/evmlogger.info -l /u01/11.2.0/grid/evm/log/evmlogger.log grid 4519 1 27 13:12 ? 00:19:39 /u01/11.2.0/grid/bin/oraagent.bin root 4525 1 22 13:12 ? 00:16:20 /u01/11.2.0/grid/bin/orarootagent.bin grid 4712 1 0 13:12 ? 00:00:00 /u01/11.2.0/grid/bin/scriptagent.bin grid 4814 1 0 13:12 ? 00:00:00 /u01/11.2.0/grid/bin/tnslsnr listener -inherit root 23792 5354 0 14:24 pts/0 00:00:00 grep d.bin [root@wbtdb2 ~]# kill -9 2546 3632 3643 3673 3683 3685 3698 3717 3740 3751 3997 3974 4519 4525 4712 4814 [root@wbtdb2 ~]# ps -ef|grep d.bin root 30270 5354 0 14:25 pts/0 00:00:00 grep d.bin</code>

3.4 更新叢集節點資訊

grid使用者在任何運作正常，所有保留的節點上運作以下指令：

<code>[grid@rac11g1 ~]$ cd $oracle_home/oui/bin [grid@rac11g1 bin]$ echo $oracle_home /u01/11.2.0/grid $ ./runinstaller -updatenodelist oracle_home=grid_home "cluster_nodes={remaining_nodes_list}" crs=true -silent</code>例如：

操作如下：

<code>[grid@wbtdb1 bin]$ /u01/11.2.0/grid/oui/bin/runinstaller -updatenodelist oracle_home=$oracle_home "cluster_nodes={wbtdb1}" crs=true -silent starting oracle universal installer... checking swap space: must be greater than 500 mb. actual 4095 mb passed the inventory pointer is located at /etc/orainst.loc the inventory is located at /u01/app/orainventory 'updatenodelist' was successful.</code>

--檢查：

<code>[grid@wbtdb1 bin]$ olsnodes -s -t wbtdb1 active unpinned [grid@wbtdb1 bin]$</code>

<code>[grid@wbtdb1 bin]$ cluvfy stage -post nodedel -n wbtdb2 -verbose performing post-checks for node removal checking crs integrity... clusterware version consistency passed the oracle clusterware is healthy on node "wbtdb1" crs integrity check passed result: node removal check passed post-check for node removal was successful. [grid@racdb1 ~]$ crsctl status resource -t [grid@wbtdb1 ~]$ crsctl status resource -t -------------------------------------------------------------------------------- name target state server state_details -------------------------------------------------------------------------------- local resources -------------------------------------------------------------------------------- ora.archdg.dg online online wbtdb1 ora.datadg.dg online online wbtdb1 ora.listener.lsnr online online wbtdb1 ora.ocrvoting.dg online online wbtdb1 ora.asm online online wbtdb1 started ora.gsd offline offline wbtdb1 ora.net1.network online online wbtdb1 ora.ons online online wbtdb1 ora.registry.acfs online online wbtdb1 -------------------------------------------------------------------------------- cluster resources -------------------------------------------------------------------------------- ora.listener_scan1.lsnr 1 online online wbtdb1 ora.cvu 1 online online wbtdb1 ora.oc4j 1 online online wbtdb1 ora.scan1.vip 1 online online wbtdb1 ora.wbtdb.db 1 online online wbtdb1 open ora.wbtdb1.vip 1 online online wbtdb1 [grid@wbtdb1 ~]$</code>

至此叢集中的節點2的資訊完全清除完畢！

後面可以自行驗證保留的叢集資源，以及執行個體狀态是否正常。

記一次Rac節點故障處理及恢複（上篇）

繼續閱讀

關于Gradle配置的小結

Java小案例——随機數猜測随機數猜測

nginx location中斜線的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method