天天看點

【Oracle】使用hanganalyze 指令分析資料庫hang【轉】

1. 資料庫hang的幾種可能性

oracle 死鎖或者系統負載非常高比如cpu使用或其他一些鎖等待很高都可能導緻系統hang住,比如大量的DX鎖。

通常來說,我們所指的系統hang住,是指應用無響應,普通的sqlplus幾乎無法操作等等。

2. 如何進行hang分析?hang分析有哪些level?如何選擇level?

hanganalyze有如下幾種level:

10     Dump all processes (IGN state)

5      Level 4 + Dump all processes involved in wait chains (NLEAF state)

4      Level 3 + Dump leaf nodes (blockers) in wait chains (LEAF,LEAF_NW,IGN_DMP state)

3      Level 2 + Dump only processes thought to be in a hang (IN_HANG state)

1-2    Only HANGANALYZE output, no process dump at all

如何選擇level?

一般來說,不建議使用3以上級别的hang分析,因為可能會産生非常大的trace,還可能對系統的IO有一定影響。

從oracle 9i開始 hanganalyze提供給了對rac的支援。

有如下2種方式:

1) ALTER SESSION SET EVENTS 'immediate trace name HANGANALYZE level ';

2) 使用oradebug 指令

   ORADEBUG setmypid

   ORADEBUG setinst all

   ORADEBUG -g def hanganalyze   ---針對rac的用法

   oradebug setmypid

   oradebug hanganalyze 3               ---非rac環境

通常在做hang分析的時候,oracle建議同時做一個systemstate的dump

oradebug SYSTEMSTATE dump level 2     level 2即可, 包含了所有session的資訊。

      sqlplus -prelim / as sysdba       ---10g可以使用此方式登入

      oradebug setospid

      oradebug unlimit

      oradebug dump systemstate 10

補充:有時候我們可能還需要對某個程序進行trace aix環境,我們可以使用dbx指令

如下例子:

dbx -a PID (where PID = any oracle shadow process)       ---通過ps -ef|grep xxx檢視

dbx() print ksudss(10)

dbx() detach

3. 如何解讀hang分析的trace檔案,擷取有用資訊?

*** ACTION NAME:() 2010-03-12 00:04:01.497

*** MODULE NAME:(sqlplus@S7_C_YZ_YZSJK(TNS V1-V3)) 2010-03-12 00:04:01.497    ---子產品名 跟v$session.module_name一樣

*** SERVICE NAME:(SYS$USERS) 2010-03-12 00:04:01.497

*** SESSION ID:(5184.45287) 2010-03-12 00:04:01.497           --sid (5184)   serial# (35287)

*** 2010-03-12 00:04:01.497

==============

HANG ANALYSIS:

Found 54 objects waiting for

                     --從這裡看 session 5210 阻塞了54個對象

Open chains found:

Chain 1 : :   --從這裡開始 以下的session都是被前面的5210阻塞 通常來說是一個阻塞另一個!

--

Other chains found:                                           --下面的session也是被前面所阻塞 不過不是直接阻塞(by Open chains) 間接阻塞!

Chain 2 : :

Chain 3 : :

Chain 4 : :

Cycle 1 : :        -- cycle 通常是死鎖 一般來說很有可能就是hang的根源

4. 不同版本hang分析的差異?trace有何異同?

如下是oracle8~10g的 hanganalyze trace資訊格式:

Oracle 8.x : [nodenum]/sid/sess_srno/session/state/start/finish/[adjlist]/predecessor

Oracle 9i:   [nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor

Oracle 10g: [nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor

Nodenum     --》 每個session做hanganalyze生成的一個序列号

sid         --》 Session ID

sess_srno   --》 Serial#

ospid       --》 OS Process Id (v$process spid)

state       --》 State of the node

adjlist     --》 adjacent node   (Usually represents a blocker node) --通常是阻塞者

predecessor --》 predecessor node (Usually represents a waiter node) --通常是被阻塞者

cnode       --》 節點号 從9i開始才有

關于state 有如下幾種值:

IN_HANG      --》 該狀态是一個非常危險的狀态,通常表現為一個節點陷入了死循環或是hung。 一般來說出現這種情況,該節點的臨辟節點也是一樣的狀态 即adjlist

            如下例子:

            [16]/0/17/154/0x24617be0/26800/IN_HANG/29/32/[185]/19      ---從IN_HANG 我們可以看出 185是16的鄰居節點,185被16阻塞

            [185]/1/16/4966/0x24617270//IN_HANG/30/31/[16]/16          ---從這裡看 185阻塞了16(16是waiter)

LEAF         --》通常是被認為blockers的重點對象。那麼如何去确定呢? 一般來說,根據後面的predecesor來判斷該session是不是blocker或者是waiter。

             如下例子:

             [ nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor

             [16]/0/17/154/0x24617be0/26800/LEAF/29/30//19         --從這裡看19是waiter 是以我們認為17阻塞了20

             [19]/0/20/13/0x24619830/26791/NLEAF/33/34/[16]/186     

LEAF_NW     --》 跟leaf類似 不過可能會占用cpu

NLEAF       --》該狀态的session通常被認為 “stuck” session。即其他session所需要的資源正被其holding。

IGN         --》該狀态的session通常是處理IDLE狀态,除非其adjlist存在,如果是,那麼該session正在等待其他session。

IGN_DMP     --》跟 IGN 類似。

[nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor

[16]/0/17/154/0x24617be0/26800/LEAF/29/30//19

[19]/0/20/13/0x24619830/26791/NLEAF/33/34/[16]/186

[189]/1/20/36/0x24619830//IGN/95/96/[19]/none

[176]/1/7/1/0x24611d80//IGN/75/76//none

----從上面看,189在等待19,19在等待16,而176是一個idle session。

SINGLE_NODE,SINGLE_NODE_NW 可以認為跟LEAF,LEAF_NW一樣,除非沒有依賴對象。

本節我基于scott使用者産生兩個會話,模拟死鎖會話(一個update,一個delete)

SQL> oradebug help

HELP           [command]                 Describe one or all commands

SETMYPID                                 Debug current process

SETOSPID                          Set OS pid of process to debug

SETORAPID      ['force']        Set Oracle pid of process to debug

SHORT_STACK                              Dump abridged OS stack

DUMP           [addr]  Invoke named dump

DUMPSGA        [bytes]                   Dump fixed SGA

DUMPLIST                                 Print a list of available dumps

EVENT                              Set trace event in process

SESSION_EVENT                      Set trace event in session

DUMPVAR       

[level]  Print/dump a fixed PGA/SGA/UGA variable

DUMPTYPE         Print/dump an address with type info

SETVAR        

  Modify a fixed PGA/SGA/UGA variable

PEEK           [level]      Print/Dump memory

POKE                 Modify memory

WAKEUP                           Wake up Oracle process

SUSPEND                                  Suspend execution

RESUME                                   Resume execution

FLUSH                                    Flush pending writes to trace file

CLOSE_TRACE                              Close trace file

TRACEFILE_NAME                           Get name of trace file

LKDEBUG                                  Invoke global enqueue service debugger

NSDBX                                    Invoke CGS name-service debugger

-G                Parallel oradebug command prefix

-R                Parallel oradebug prefix (return output

SETINST              Set instance list in double quotes

SGATOFILE               Dump SGA to file; dirname in double quotes

DMPCOWSGA      Dump & map SGA as COW; dirname in double quotes

MAPCOWSGA               Map SGA as COW; dirname in double quotes

HANGANALYZE    [level] [syslevel]        Analyze system hang

FFBEGIN                                  Flash Freeze the Instance

FFDEREGISTER                             FF deregister instance from cluster

FFTERMINST                               Call exit and terminate instance

FFRESUMEINST                             Resume the flash frozen instance

FFSTATUS                                 Flash freeze status of instance

SKDSTTPCS                Helps translate PCs to names

WATCH            Watch a region of memory

DELETE         watchpoint     Delete a watchpoint

SHOW           watchpoints        Show  watchpoints

CORE                                     Dump core without crashing process

IPC                                      Dump ipc information

UNLIMIT                                  Unlimit the size of the trace file

PROCSTAT                                 Dump process statistics

CALL           [arg1] ... [argn]  Invoke function with arguments

SQL> oradebug hanganalyze 3;

Hang Analysis in /oracle/admin/orcl/udump/orcl_ora_2622.trc

SQL> exit

Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production

With the Partitioning, OLAP and Data Mining options

-bash-3.2$ more /oracle/admin/orcl/udump/orcl_ora_2622.trc

/oracle/admin/orcl/udump/orcl_ora_2622.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production

ORACLE_HOME = /oracle/product/10.2.0/db_1

System name:    Linux

Node name:      truerhel5

Release:        2.6.18-164.el5

Version:        #1 SMP Tue Aug 18 15:51:48 EDT 2009

Machine:        x86_64

Instance name: orcl

Redo thread mounted by this instance: 1

Oracle process number: 21

Unix process pid: 2622, image:oracle@truerhel5(TNS V1-V3)

*** SERVICE NAME:(SYS$USERS) 2010-08-07 21:11:10.818

*** SESSION ID:(145.36) 2010-08-07 21:11:10.818

*** 2010-08-07 21:11:10.818

Chain 1 : : --每列的注解:分為cnode sid sess_srno proc_ptr ospid wait_event

       --會話148(持鎖會話)

-- --會話146(等待鎖會話),競争事件為:row lock contention

Other chains found:

Chain 5 : :

Chain 6 : :

Extra information that will be dumped at higher levels:

[level  4] :   1 node dumps -- [REMOTE_WT] [LEAF] [LEAF_NW]

[level  5] :   5 node dumps -- [SINGLE_NODE] [SINGLE_NODE_NW] [IGN_DMP]

[level  6] :   1 node dumps -- [NLEAF]

[level 10] :  13 node dumps -- [IGN]

State of nodes

([nodenum]/cnode/sid/sess_srno/session/ospid/state/start/finish/[adjlist]/predecessor):

[143]/0/144/108/0x70f5dcf8/2614/SINGLE_NODE/1/2//none

[144]/0/145/36/0x70f5f130/2622/SINGLE_NODE_NW/3/4//none

[145]/0/146/84/0x70f60568/2607/NLEAF/5/8/[147]/none

[147]/0/148/27/0x70f62dd8/2543/LEAF/6/7//145

[149]/0/150/2/0x70f65648/2338/SINGLE_NODE/9/10//none

[150]/0/151/1/0x70f66a80/2319/SINGLE_NODE/11/12//none

[154]/0/155/1/0x70f6bb60/2315/IGN/13/14//none

[155]/0/156/1/0x70f6cf98/2313/IGN/15/16//none

[157]/0/158/7/0x70f6f808/2336/SINGLE_NODE/17/18//none

[159]/0/160/1/0x70f72078/2305/IGN/19/20//none

[160]/0/161/1/0x70f734b0/2303/IGN/21/22//none

[161]/0/162/1/0x70f748e8/2301/IGN/23/24//none

[162]/0/163/1/0x70f75d20/2299/IGN/25/26//none

[163]/0/164/1/0x70f77158/2297/IGN/27/28//none

[164]/0/165/1/0x70f78590/2295/IGN/29/30//none

[165]/0/166/1/0x70f799c8/2293/IGN/31/32//none

[166]/0/167/1/0x70f7ae00/2291/IGN/33/34//none

[167]/0/168/1/0x70f7c238/2289/IGN/35/36//none

[168]/0/169/1/0x70f7d670/2287/IGN/37/38//none

[169]/0/170/1/0x70f7eaa8/2285/IGN/39/40//none

====================

END OF HANG ANALYSIS

其内容意思大概如下

cnode--節點代号,如果為rac,其值就存在,單節點的值為0

sid---session的sid

sess_srno---session的serial#

proc_ptr--系統程序指向的address

ospid ----程序号

wait_event---session的等待事件

轉摘白大師部分節選

Hanganalyze是從Oracle 8i r2(8.1.6)開始提供的,其用法十分簡單:

ALTER SESSION SET EVENTS 'immediate trace name HANGANALYZE level ';

或者

ORADEBUG hanganalyze

比如:

sql>oradebug setmypid;

sql>oradebug hanganalyze 3;

對于:

      10     Dump all processes (IGN state)

      5      Level 4 + Dump all processes involved in wait chains (NLEAF state)

      4      Level 3 + Dump leaf nodes (blockers) in wait chains (LEAF,LEAF_NW,IGN_DMP state)

      3      Level 2 + Dump only processes thought to be in a hang (IN_HANG state)

    1-2    Only HANGANALYZE output, no process dump at all

-bash-3.2$ sqlplus -prelim '/as sysdba' --通過prelim選項進入已經hang住(正常方式進不了sqlplus)的資料庫

SQL*Plus: Release 10.2.0.1.0 - Production on Sat Aug 7 21:17:42 2010

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

SQL> show parameter sga

ORA-01012: not logged on

SQL> conn /as sysdba

Prelim connection established

SQL>