天天看點

Oracle資料庫一次ORA-00600[4194]事件處理一、問題背景二、問題排查&處理三、其他

一、問題背景

1、20191120某客戶托爾斯程式運作異常,無法正常工作

2、檢視程式日志顯示ORA-00600、ORA-02002等資料庫錯誤

3、資料庫伺服器資訊

    ①系統:CentOS 6.8(64位)

    ②資料庫:Oracle RAC 11.2.0.4

二、問題排查&處理

1、檢視資料庫資訊

    ①首先檢視了Oracle RAC叢集資訊,資料庫叢集運作正常

    ②檢視資料庫日志,節點1運作正常,節點2 alert日志中有大量ORA600資訊:

Wed Nov 20 11:28:30 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_11752.trc  (incident=51957):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 946, block 733 to scn 2776710962
Recovery of Online Redo Log: Thread 2 Group 3 Seq 946 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_3.261.989703707
Block recovery completed at rba 946.735.16, scn 0.2776710963
Block recovery from logseq 946, block 733 to scn 2776711284
Recovery of Online Redo Log: Thread 2 Group 3 Seq 946 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_3.261.989703707
Block recovery completed at rba 946.849.16, scn 0.2776711285
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_11752.trc  (incident=51958):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
ORA-02002: error while writing to audit trail
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.           

    ③檢視目前資料庫叢集中每個節點上會話資訊,節點1上有程式連接配接進來的資訊,節點2基本都是系統使用者會話資訊

2、根據客戶回報的資訊,進行下一步處理

    ①客戶回報之前也遇到過目前的情況(ORA2002),之前是對system表空間進行手動resize,之後程式就會恢複正常,但是這次操作了之後也沒有恢複正常,程式報錯如下:

Oracle資料庫一次ORA-00600[4194]事件處理一、問題背景二、問題排查&處理三、其他

    ②根據客戶描述的資訊檢查資料庫審計情況

        a、發現資料庫審計參數AUDIT_TRAIL=DB,是資料庫預設的配置

        b、接下來檢查system表空間資訊,目前system表空間是3G,使用率70%(客戶自己resize後),并且system表空間是自動擴充,看來問題不是由于審計表空間不足引起的

    ③system充足的情況下,程式多次重新開機還是報ORA2002錯誤,看日志顯示ORA600後緊接着就是ORA2002,為了及時恢複業務,客戶也表示沒有使用到資料庫審計,決定關閉資料庫審計功能

3、資料庫修改了AUDIT_TRAIL=NONE,truncate了SYS.AUD$表資料,關閉了資料庫叢集,進行了重新開機,觀察程式情況:

    ①程式在關閉資料庫審計後啟動,仍然報錯,此時報錯内容隻剩下了ORA600

Oracle資料庫一次ORA-00600[4194]事件處理一、問題背景二、問題排查&處理三、其他

    ②檢視目前資料庫中session資訊,和之前一樣,節點1有程式連接配接資訊,節點2上沒有

    ③這時再檢視節點2alert日志資訊,ORA2002雖然沒有了,但是ORA600仍然不斷列印出來

Wed Nov 20 21:36:11 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_19012.trc  (incident=100413):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/racecs/racecs2/incident/incdir_100413/racecs2_ora_19012_i100413.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Wed Nov 20 21:36:16 2019
Dumping diagnostic data in directory=[cdmp_20191120213616], requested by (instance=2, osid=19012), summary=[incident=100413].
Block recovery from logseq 972, block 168 to scn 2776911378
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.173.16, scn 0.2776911845
Block recovery from logseq 972, block 168 to scn 2776914575
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.657.16, scn 0.2776914576
Wed Nov 20 21:38:25 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_18696.trc  (incident=100333):
ORA-00600: internal error code, arguments: [4194], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/racecs/racecs2/incident/incdir_100333/racecs2_ora_18696_i100333.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
...
Wed Nov 20 21:46:09 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_18732.trc  (incident=100373):
ORA-00600: Śҿխϳպë, ӎ˽: [4194], [], [], [], [], [], [], [], [], [], [], []
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 972, block 1011 to scn 2776915928
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1016.16, scn 0.2776915929
Block recovery from logseq 972, block 1011 to scn 2776916012
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1031.16, scn 0.2776916013
Wed Nov 20 21:48:41 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_21433.trc  (incident=100445):
ORA-00600: Śҿխϳպë, ӎ˽: [4194], [ody "SYS.DBMS_STANDARD"
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Block recovery from logseq 972, block 1011 to scn 2776915928
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1016.16, scn 0.2776915929
Block recovery from logseq 972, block 1011 to scn 2776916447
Recovery of Online Redo Log: Thread 2 Group 7 Seq 972 Reading mem 0
  Mem# 0: +DATA/racecs/onlinelog/group_7.log
Block recovery completed at rba 972.1073.16, scn 0.2776916448           

4、根據報錯内容在MOS檢視相關問題,發現文檔1428786.1和這次的現象類似,并且節點2伺服器在15号多次發生了異常重新開機,引起資料庫多次異常重新開機

Fri Nov 15 08:32:28 2019
Active Session History (ASH) performed an emergency flush. This may mean that ASH is undersized. If emergency flushes are a recurring issue, you may consider increasing ASH size by setting the value of _ASH_SIZE to a sufficiently large value. Currently, ASH size is 16777216 bytes. Both ASH size and the total number of emergency flushes since instance startup can be monitored by running the following query:
 select total_size,awr_flush_emergency_count from v$ash_info; 
Fri Nov 15 09:06:34 2019
IPC Send timeout detected. Receiver ospid 7546 [
Fri Nov 15 09:06:34 2019
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_ora_7546.trc:
Fri Nov 15 09:29:15 2019
minact-scn: useg scan erroring out with error e:12751
Fri Nov 15 09:53:19 2019
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_asmb_3794.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 217 Serial number: 61
Errors in file /u01/app/oracle/diag/rdbms/racecs/racecs2/trace/racecs2_asmb_3794.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 217 Serial number: 61
ASMB (ospid: 3794): terminating the instance due to error 15064
Termination issued to instance processes. Waiting for the processes to exit
Fri Nov 15 09:53:30 2019
Instance termination failed to kill one or more processes
Instance terminated by ASMB, pid = 3794
Fri Nov 15 10:16:47 2019
Starting ORACLE instance (normal)           

5、根據MOS文檔裡面的描述,很可能就是由于執行個體2伺服器異常重新開機導緻資料庫異常,MOS文檔描述原因如下:

The following error is occurring in the alert.log right before the database crashes.
ORA-00600: internal error code, arguments: [4194], [#], [#], [], [], [], [], []
This error indicates that a mismatch has been detected between redo records and rollback (undo) records.

ARGUMENTS:

Arg [a] - Maximum Undo record number in Undo block
Arg [b] - Undo record number from Redo block

Since we are adding a new undo record to our undo block, we would expect that the new record number is equal to the maximum record number in the undo block plus one. Before Oracle can add a new undo record to the undo block it validates that this is correct. If this validation fails, then an ORA-600 [4194] will be triggered.

This issue generally occurs when there is a power outage or hardware failure that initially crashes the database. On startup, the database does the normal roll forward (redo) and then rollback (undo), this is where the error is generated on the rollback.

大緻意思是:
在向undo塊添加一個新的undo記錄,會将新的記錄号等于undo塊中的最大記錄号加1。在Oracle向undo塊添加一個新的undo記錄之前,它會驗證這是正确的。如果驗證失敗,将觸發ORA-600[4194]。
這個問題通常發生在出現斷電或硬體故障時,這些故障最初會導緻資料庫崩潰。在啟動時,資料庫執行普通的前滾(重做),然後復原(撤消),這是在復原時産生錯誤的地方。
從這裡在結合執行個體2在15号多次重新開機的現象,可以看出很可能就是因為執行個體2伺服器異常重新開機導緻           

6、根據上面定位到的問題對資料庫進行修複

*建立一個新的undo表空間,使用新的undo段号,它比目前使用的段号要高。這樣,當一個事務執行清除塊時,對撤消段的引用不存在,并繼續清除塊,這樣就不會在産生上面發生的内容。
執行個體2進行操作
①、Create pfile='/tmp/initsid.ora' from spfile;
②、修改undo
a、檢視目前undo表空間位置和大小
b、執行個體2建立新的undo,替換之前的undo
  create undo tablespace undotbs3 datafile '+DATA/racecs/datafile/undotbs3.dbf' size 1500m autoextend on;(注意根據之前undo值進行替換)
  
  alter system set undo_tablespace='undotbs3' scope=spfile sid='racecs2';
③、執行個體2資料庫重新開機
  shutdown immediate
  startup           

7、操作完成之後,程式連接配接不在報錯

三、其他

1、在資料庫恢複之後,隔了一天又回報出現問題,發現執行個體2伺服器再一次發生了重新開機,這次資料庫日志中到沒有出現ORA600資訊,資料庫運作正常

2、接下來應該重點排查執行個體2伺服器異常重新開機原因