天天看點

剛裝完的資料庫報錯 ORA-01102 ORA-1102 signalled during....

     昨天剛裝完的一個資料庫在啟動的時候,報錯ORA-01102,而且安裝的時候也沒有看到哪裡有報錯資訊,一路都比較順利,

而且這也是第一次我碰到這個問題,當時我首先就檢查了alert日志檔案,并把相關的錯誤資訊在metalink上檢視過了,

經過分析後判斷是由于程序間通信被争用導緻,以下是我處理該問題的一個思路,并在最後附上了metalink原文以及朋友對該

問題的一個了解和處理辦法。

     為什麼會發生如下錯誤,原因是多個使用者同時去通路同一個資源就會發生獨占模式,

因為在Linux裡面預設一個程序隻被一個使用者通路,要避免這個問題,在建立使用者的時候

指定預設去指定不同于其它使用者的優先級就可以避免此類問題的發生。

sculkget: failed to lock /orasoft/product/10.2.0/db_1/dbs/lkWWL exclusive   同一個程序被多個使用者通路發生了獨占模式

sculkget: lock held by PID: 26312                                           發生獨占模式的程序号為pid:26312

ORA-09968: Message 9968 not found; No message file for product=RDBMS, facility=ORA  并且沒有找到9968的資料信号,同時了我們該信号的類型

Linux Error: 11: Resource temporarily unavailable                           導緻資源無法被正常利用

Additional information: 26312

Thu Nov 17 15:51:16 2011

ORA-1102 signalled during: ALTER DATABASE   MOUNT...

解決如上錯誤過程如下:

1、我們可以通過如下指令檢視到發生獨占的程序名稱為ora_dbw0_wwl

[oracle@ora10g dbs]$ ps -ef|grep 26312

oracle   26312     1  0 15:43 ?        00:00:02 ora_dbw0_wwl

oracle   26663 26574  0 17:39 pts/1    00:00:00 grep 26312

2、進入資料庫,先關閉執行個體

[oracle@ora10g ~]$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on Thu Nov 17 17:45:56 2011

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

Connected to:

Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production

With the Partitioning, OLAP and Data Mining options

SQL> shutdown immediate

ORA-01507: database not mounted

ORACLE instance shut down.

SQL> exit

Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production

進入到 $ORACLE_HOME/dbs,檢視到一個名為lkWWL的檔案,正常情況下是沒有這個檔案的

[oracle@ora10g ~]$ cd $ORACLE_HOME/dbs

[oracle@ora10g dbs]$ ls

hc_wwl.dat  initdw.ora  init.ora  lkWWL  orapwwwl  spfilewwl.ora

[oracle@ora10g dbs]$ su - root

密碼:

通過fuser -u lkWWL 指令一看,果然果然程序沒有被釋放

[root@ora10g ~]# cd /orasoft/product/10.2.0/db_1/dbs

[root@ora10g dbs]# fuser -u lkWWL

lkWWL:               26306 26308 26310 26312 26314 26316 26318 26320 26322 26324 26326 26334 26336 26340 26354 26356

[root@ora10g dbs]# fuser -k lkWWL

重新啟動資料庫看看,這個時候資料庫沒有報錯了,能正常起來。

[root@ora10g dbs]# su - oracle

SQL*Plus: Release 10.2.0.1.0 - Production on Thu Nov 17 17:47:50 2011

Connected to an idle instance.

SQL> startup

ORACLE instance started.

Total System Global Area  285212672 bytes

Fixed Size                  1218992 bytes

Variable Size              92276304 bytes

Database Buffers          188743680 bytes

Redo Buffers                2973696 bytes

Database mounted.

Database opened.

SQL> col host_name format a20

SQL> select host_name,instance_name,status from v$instance

HOST_NAME            INSTANCE_NAME    STATUS

-------------------- ---------------- ------------

ora10g.localdomain   wwl              OPEN

SQL>

Metalink 原文如下:

analysis:

Problem Description:

==================== 

You are trying to startup the database and you receive the following error:

     ORA-01102:  cannot mount database in EXCLUSIVE mode

       Cause:  Some other instance has the database mounted exclusive

               or shared.

      Action: Shutdown other instance or mount in a compatible mode.

    Problem Explanation:

A database is started in EXCLUSIVE mode by default.  Therefore, the

ORA-01102 error is misleading and may have occurred due to one of the

following reasons: 

  - there is still an "sgadef<sid>.dbf" file in the "ORACLE_HOME/dbs"

    directory

  - the processes for Oracle (pmon, smon, lgwr and dbwr) still exist

  - shared memory segments and semaphores still exist even though the

    database has been shutdown

  - there is a "ORACLE_HOME/dbs/lk<sid>" file

   Search Words:

============= 

ORA-1102, crash, immediate, abort, fail, fails, migration

Solution Description:

===================== 

Verify that the database was shutdown cleanly by doing the following:

  1. Verify that there is not a "sgadef<sid>.dbf" file in the directory

   "ORACLE_HOME/dbs".   

        % ls $ORACLE_HOME/dbs/sgadef<sid>.dbf

     If this file does exist, remove it. 

        % rm $ORACLE_HOME/dbs/sgadef<sid>.dbf 

2. Verify that there are no background processes owned by "oracle"

          % ps -ef | grep ora_ | grep $ORACLE_SID

     If background processes exist, remove them by using the Unix

   command "kill".  For example:

          % kill -9 <rocess_ID_Number>

  3. Verify that no shared memory segments and semaphores that are owned

   by "oracle" still exist

          % ipcs -b

     If there are shared memory segments and semaphores owned by "oracle",

   remove the shared memory segments

          % ipcrm -m <Shared_Memory_ID_Number>

     and remove the semaphores

          % ipcrm -s <Semaphore_ID_Number>

     NOTE:  The example shown above assumes that you only have one

          database on this machine.  If you have more than one

          database, you will need to shutdown all other databases

          before proceeding with Step 4.

  4. Verify that the "$ORACLE_HOME/dbs/lk<sid>" file does not exist

  5. Startup the instance

    Solution Explanation:

The "lk<sid>" and "sgadef<sid>.dbf" files are used for locking shared memory.  It seems that even though no memory is allocated, Oracle thinks memory is  still locked.  By removing the "sgadef" and "lk" files you remove any knowledge oracle has of shared memory

that is in use. Now the database can start.

我朋友對該問題的了解和解決辦法如下:

出現1102錯誤可能有以下幾種可能:

一、在HA系統中,已經有其他節點啟動了執行個體,将雙機共享的資源(如磁盤陣列上的裸裝置)占用了;

二、說明Oracle被異常關閉時,有資源沒有被釋放,一般有以下幾種可能,

1、Oracle的共享記憶體段或信号量沒有被釋放;

2、Oracle的背景程序(如SMON、PMON、DBWn等)沒有被關閉;

3、用于鎖記憶體的檔案lk<sid>和sgadef<sid>.dbf檔案沒有被删除。

solution:

method1:

首先,雖然我們的系統是HA系統,但是備節點的執行個體始終處在關閉狀态,這點通過在備節點上查資料庫狀态可以證明。

其次、是因系統掉電引起資料庫當機的,系統在接電後被重新開機,是以我們排除了第二種可能種的1、2點。最可疑的就是第3點了。

查$ORACLE_HOME/dbs目錄:

$ cd $ORACLE_HOME/dbs

$ ls sgadef*

sgadef* not found

$ ls lk*

lkORA92

果然,lk<sid>檔案沒有被删除。将它删除掉

$ rm lk*

再啟動資料庫,成功。

如果懷疑是共享記憶體沒有被釋放,可以用以下指令檢視:

$ipcs -mop

IPC status from /dev/kmem as of Thu Jul  6 14:41:43 2006

T      ID     KEY        MODE        OWNER     GROUP NATTCH  CPID  LPID

Shared Memory:

m       0 0x411c29d6 --rw-rw-rw-      root      root      0   899   899

m       1 0x4e0c0002 --rw-rw-rw-      root      root      2   899   901

m       2 0x4120007a --rw-rw-rw-      root      root      2   899   901

m  458755 0x0c6629c9 --rw-r-----      root       sys      2  9113 17065

m       4 0x06347849 --rw-rw-rw-      root      root      1  1661  9150

m   65541 0xffffffff --rw-r--r--      root      root      0  1659  1659

m  524294 0x5e100011 --rw-------      root      root      1  1811  1811

m  851975 0x5fe48aa4 --rw-r-----    oracle  oinstall     66  2017 25076

然後它ID号清除共享記憶體段:

$ipcrm –m 851975

對于信号量,可以用以下指令檢視:

$ ipcs -sop

IPC status from /dev/kmem as of Thu Jul  6 14:44:16 2006

T      ID     KEY        MODE        OWNER     GROUP

Semaphores:

s       0 0x4f1c0139 --ra-------      root      root

... ...

s      14 0x6c200ad8 --ra-ra-ra-      root      root

s      15 0x6d200ad8 --ra-ra-ra-      root      root

s      16 0x6f200ad8 --ra-ra-ra-      root      root

s      17 0xffffffff --ra-r--r--      root      root

s      18 0x410c05c7 --ra-ra-ra-      root      root

s      19 0x00446f6e --ra-r--r--      root      root

s      20 0x00446f6d --ra-r--r--      root      root

s      21 0x00000001 --ra-ra-ra-      root      root

s   45078 0x67e72b58 --ra-r-----    oracle  oinstall

根據信号量ID,用以下指令清除信号量:

$ipcrm -s 45078

如果是Oracle程序沒有關閉,用以下指令查出存在的oracle程序:

$ ps -ef|grep ora

  oracle 29976     1  0  Jun 22  ?         0:52 ora_dbw0_ora92

  oracle 29978     1  0  Jun 22  ?         0:51 ora_dbw1_ora92

  oracle  5128     1  0  Jul  5  ?         0:00 oracleora92 (LOCAL=NO)

然後用kill -9指令殺掉程序

$kill -9 <ID>

method 2

[root@qa-oracle dbs]# fuser -u lkNDMSQA

lkNDMSQA:             6666(oracle)  6668(oracle)  6670(oracle)  6672(oracle)  6674(oracle)  6676(oracle)  6678(oracle)  6680(oracle)  6690(oracle)  6692(oracle)  6694(oracle)  6696(oracle)  6737(oracle)  6830(oracle)

果然該檔案沒釋放,用fuser指令kill掉:

[root@qa-oracle dbs]# fuser -k lkNDMSQA

lkNDMSQA:             6666  6668  6670  6672  6674  6676  6678  6680  6690  6692  6694  6696  6737  6830

總結:

當發生1102錯誤時,可以按照以下流程檢查、排錯:

如果是HA系統,檢查其他節點是否已經啟動執行個體;

檢查Oracle程序是否存在,如果存在則殺掉程序;

檢查信号量是否存在,如果存在,則清除信号量;

檢查共享記憶體段是否存在,如果存在,則清除共享記憶體段;

檢查鎖記憶體檔案lk<sid>和sgadef<sid>.dbf是否存在,如果存在,則删除。

ORA-09968: unable to lock file lk$ORACLE_SID (2010-03-04 14:53)

分類: DBA

starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...

starting up 1 shared server(s) ...

Thu Mar  4 11:48:07 2010

ALTER DATABASE   MOUNT

sculkget: failed to lock /u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS exclusive

sculkget: lock held by PID: 3443

ORA-09968: unable to lock file

Linux Error: 11: Resource temporarily unavailable

Additional information: 3443

提示程序3443鎖定該資源,根據上次的啟動日志發現該程序是Oracle的背景程序

DBWR,根據文檔提示236794.1可能是該程序已經挂死,導緻資料庫無法正常運作。

fuser -u /u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS

PMON started with pid=2, OS id=3437

MMAN started with pid=4, OS id=3441

PSP0 started with pid=3, OS id=3439

DBW0 started with pid=5, OS id=3443

LGWR started with pid=6, OS id=3445

CKPT started with pid=7, OS id=3447

SMON started with pid=8, OS id=3449

RECO started with pid=9, OS id=3451

CJQ0 started with pid=10, OS id=3453

MMON started with pid=11, OS id=3455

Tue Feb 16 11:08:17 2010

MMNL started with pid=12, OS id=3457

Tue Feb 16 11:08:18 2010

Tue Feb 16 11:08:22 2010

Setting recovery target incarnation to 2

Successful mount of redo thread 1, with mount id 1844152034

Database mounted in Exclusive Mode

Completed: ALTER DATABASE   MOUNT

ALTER DATABASE OPEN

losf 檢視鎖定程序

# lsof |grep lkFDS                                      

oracle     4476 oracle   17uR     REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4478 oracle   15uR     REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4480 oracle   15uR     REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4482 oracle   15uR     REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4484 oracle   15uR     REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4486 oracle   15uR     REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4488 oracle   15uR     REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4490 oracle   15uR     REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4492 oracle   15uR     REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4494 oracle   15uR     REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4496 oracle   15uR     REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4513 oracle   15u      REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4531 oracle   15u      REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4534 oracle   15u      REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

oracle     4812 oracle   15u      REG        8,7         24    2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS

fuser檢視鎖定程序

# fuser -u /u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS

/u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS:  4476(oracle)  4478(oracle)  4480(oracle)  4482(oracle)  4484(oracle)  4486(oracle)  4488(oracle)  4490(oracle)  4492(oracle)  4494(oracle)  4496(oracle)  4513(oracle)  4531(oracle)  4534(oracle)  4812(oracle)

[root@CHN-DG-3-5CE ~]#

請教fuser的作用及具體用法!

fuser Command

Purpose

Identifies processes using a file or file structure.

Syntax

fuser [ -c | -d | -f ] [ -k ] [ -u ] [ -x ] [ -V ]File ...

Description

The fuser command lists the process numbers of local processes that use the

local or remote files specified by the File parameter. For block special

devices, the command lists the processes that use any file on that device.

c Uses the file as the current directory.

e Uses the file as a program's executable object.

r Uses the file as the root directory.

s Uses the file as a shared library (or other loadable object).

The process numbers are written to standard output in a line with spaces between

process numbers. A new line character is written to standard error after the

last output for each file operand. All other output is written to standard

error.

The fuser command will not detect processes that have mmap regions where that

associated file descriptor has since been closed.

Flags

-c Reports on any open files in the file system containing File.

-d Implies the use of the -c and -x flags. Reports on any open files which have

been unlinked from the file system (deleted from the parent directory). When

of the deleted file.

-f Reports on open instances of File only.

-k Sends the SIGKILL signal to each local process. Only the root user can kill a

process of another user.

-u Provides the login name for local processes in parentheses after the process

number.

-V Provides verbose output.

-x Used in conjunction with -c or -f, reports on executable and loadable objects

in addition to the standard fuser output.

Examples

  1. To list the process numbers of local processes using the /etc/passwd file,

     enter:

     fuser /etc/passwd

  2. To list the process numbers and user login names of processes using the

     fuser -u /etc/filesystems

  3. To terminate all of the processes using a given file system, enter:

     fuser -k -x -u /dev/hd1 -OR-

     fuser -kxuc /home

     Either command lists the process number and user name, and then terminates

     each process that is using the /dev/hd1 (/home) file system. Only the root

     user can terminate processes that belong to another user. You might want to

     use this command