昨天剛裝完的一個資料庫在啟動的時候,報錯ORA-01102,而且安裝的時候也沒有看到哪裡有報錯資訊,一路都比較順利,
而且這也是第一次我碰到這個問題,當時我首先就檢查了alert日志檔案,并把相關的錯誤資訊在metalink上檢視過了,
經過分析後判斷是由于程序間通信被争用導緻,以下是我處理該問題的一個思路,并在最後附上了metalink原文以及朋友對該
問題的一個了解和處理辦法。
為什麼會發生如下錯誤,原因是多個使用者同時去通路同一個資源就會發生獨占模式,
因為在Linux裡面預設一個程序隻被一個使用者通路,要避免這個問題,在建立使用者的時候
指定預設去指定不同于其它使用者的優先級就可以避免此類問題的發生。
sculkget: failed to lock /orasoft/product/10.2.0/db_1/dbs/lkWWL exclusive 同一個程序被多個使用者通路發生了獨占模式
sculkget: lock held by PID: 26312 發生獨占模式的程序号為pid:26312
ORA-09968: Message 9968 not found; No message file for product=RDBMS, facility=ORA 并且沒有找到9968的資料信号,同時了我們該信号的類型
Linux Error: 11: Resource temporarily unavailable 導緻資源無法被正常利用
Additional information: 26312
Thu Nov 17 15:51:16 2011
ORA-1102 signalled during: ALTER DATABASE MOUNT...
解決如上錯誤過程如下:
1、我們可以通過如下指令檢視到發生獨占的程序名稱為ora_dbw0_wwl
[oracle@ora10g dbs]$ ps -ef|grep 26312
oracle 26312 1 0 15:43 ? 00:00:02 ora_dbw0_wwl
oracle 26663 26574 0 17:39 pts/1 00:00:00 grep 26312
2、進入資料庫,先關閉執行個體
[oracle@ora10g ~]$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.1.0 - Production on Thu Nov 17 17:45:56 2011
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options
SQL> shutdown immediate
ORA-01507: database not mounted
ORACLE instance shut down.
SQL> exit
Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
進入到 $ORACLE_HOME/dbs,檢視到一個名為lkWWL的檔案,正常情況下是沒有這個檔案的
[oracle@ora10g ~]$ cd $ORACLE_HOME/dbs
[oracle@ora10g dbs]$ ls
hc_wwl.dat initdw.ora init.ora lkWWL orapwwwl spfilewwl.ora
[oracle@ora10g dbs]$ su - root
密碼:
通過fuser -u lkWWL 指令一看,果然果然程序沒有被釋放
[root@ora10g ~]# cd /orasoft/product/10.2.0/db_1/dbs
[root@ora10g dbs]# fuser -u lkWWL
lkWWL: 26306 26308 26310 26312 26314 26316 26318 26320 26322 26324 26326 26334 26336 26340 26354 26356
[root@ora10g dbs]# fuser -k lkWWL
重新啟動資料庫看看,這個時候資料庫沒有報錯了,能正常起來。
[root@ora10g dbs]# su - oracle
SQL*Plus: Release 10.2.0.1.0 - Production on Thu Nov 17 17:47:50 2011
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area 285212672 bytes
Fixed Size 1218992 bytes
Variable Size 92276304 bytes
Database Buffers 188743680 bytes
Redo Buffers 2973696 bytes
Database mounted.
Database opened.
SQL> col host_name format a20
SQL> select host_name,instance_name,status from v$instance
HOST_NAME INSTANCE_NAME STATUS
-------------------- ---------------- ------------
ora10g.localdomain wwl OPEN
SQL>
Metalink 原文如下:
analysis:
Problem Description:
====================
You are trying to startup the database and you receive the following error:
ORA-01102: cannot mount database in EXCLUSIVE mode
Cause: Some other instance has the database mounted exclusive
or shared.
Action: Shutdown other instance or mount in a compatible mode.
Problem Explanation:
A database is started in EXCLUSIVE mode by default. Therefore, the
ORA-01102 error is misleading and may have occurred due to one of the
following reasons:
- there is still an "sgadef<sid>.dbf" file in the "ORACLE_HOME/dbs"
directory
- the processes for Oracle (pmon, smon, lgwr and dbwr) still exist
- shared memory segments and semaphores still exist even though the
database has been shutdown
- there is a "ORACLE_HOME/dbs/lk<sid>" file
Search Words:
=============
ORA-1102, crash, immediate, abort, fail, fails, migration
Solution Description:
=====================
Verify that the database was shutdown cleanly by doing the following:
1. Verify that there is not a "sgadef<sid>.dbf" file in the directory
"ORACLE_HOME/dbs".
% ls $ORACLE_HOME/dbs/sgadef<sid>.dbf
If this file does exist, remove it.
% rm $ORACLE_HOME/dbs/sgadef<sid>.dbf
2. Verify that there are no background processes owned by "oracle"
% ps -ef | grep ora_ | grep $ORACLE_SID
If background processes exist, remove them by using the Unix
command "kill". For example:
% kill -9 <rocess_ID_Number>
3. Verify that no shared memory segments and semaphores that are owned
by "oracle" still exist
% ipcs -b
If there are shared memory segments and semaphores owned by "oracle",
remove the shared memory segments
% ipcrm -m <Shared_Memory_ID_Number>
and remove the semaphores
% ipcrm -s <Semaphore_ID_Number>
NOTE: The example shown above assumes that you only have one
database on this machine. If you have more than one
database, you will need to shutdown all other databases
before proceeding with Step 4.
4. Verify that the "$ORACLE_HOME/dbs/lk<sid>" file does not exist
5. Startup the instance
Solution Explanation:
The "lk<sid>" and "sgadef<sid>.dbf" files are used for locking shared memory. It seems that even though no memory is allocated, Oracle thinks memory is still locked. By removing the "sgadef" and "lk" files you remove any knowledge oracle has of shared memory
that is in use. Now the database can start.
我朋友對該問題的了解和解決辦法如下:
出現1102錯誤可能有以下幾種可能:
一、在HA系統中,已經有其他節點啟動了執行個體,将雙機共享的資源(如磁盤陣列上的裸裝置)占用了;
二、說明Oracle被異常關閉時,有資源沒有被釋放,一般有以下幾種可能,
1、Oracle的共享記憶體段或信号量沒有被釋放;
2、Oracle的背景程序(如SMON、PMON、DBWn等)沒有被關閉;
3、用于鎖記憶體的檔案lk<sid>和sgadef<sid>.dbf檔案沒有被删除。
solution:
method1:
首先,雖然我們的系統是HA系統,但是備節點的執行個體始終處在關閉狀态,這點通過在備節點上查資料庫狀态可以證明。
其次、是因系統掉電引起資料庫當機的,系統在接電後被重新開機,是以我們排除了第二種可能種的1、2點。最可疑的就是第3點了。
查$ORACLE_HOME/dbs目錄:
$ cd $ORACLE_HOME/dbs
$ ls sgadef*
sgadef* not found
$ ls lk*
lkORA92
果然,lk<sid>檔案沒有被删除。将它删除掉
$ rm lk*
再啟動資料庫,成功。
如果懷疑是共享記憶體沒有被釋放,可以用以下指令檢視:
$ipcs -mop
IPC status from /dev/kmem as of Thu Jul 6 14:41:43 2006
T ID KEY MODE OWNER GROUP NATTCH CPID LPID
Shared Memory:
m 0 0x411c29d6 --rw-rw-rw- root root 0 899 899
m 1 0x4e0c0002 --rw-rw-rw- root root 2 899 901
m 2 0x4120007a --rw-rw-rw- root root 2 899 901
m 458755 0x0c6629c9 --rw-r----- root sys 2 9113 17065
m 4 0x06347849 --rw-rw-rw- root root 1 1661 9150
m 65541 0xffffffff --rw-r--r-- root root 0 1659 1659
m 524294 0x5e100011 --rw------- root root 1 1811 1811
m 851975 0x5fe48aa4 --rw-r----- oracle oinstall 66 2017 25076
然後它ID号清除共享記憶體段:
$ipcrm –m 851975
對于信号量,可以用以下指令檢視:
$ ipcs -sop
IPC status from /dev/kmem as of Thu Jul 6 14:44:16 2006
T ID KEY MODE OWNER GROUP
Semaphores:
s 0 0x4f1c0139 --ra------- root root
... ...
s 14 0x6c200ad8 --ra-ra-ra- root root
s 15 0x6d200ad8 --ra-ra-ra- root root
s 16 0x6f200ad8 --ra-ra-ra- root root
s 17 0xffffffff --ra-r--r-- root root
s 18 0x410c05c7 --ra-ra-ra- root root
s 19 0x00446f6e --ra-r--r-- root root
s 20 0x00446f6d --ra-r--r-- root root
s 21 0x00000001 --ra-ra-ra- root root
s 45078 0x67e72b58 --ra-r----- oracle oinstall
根據信号量ID,用以下指令清除信号量:
$ipcrm -s 45078
如果是Oracle程序沒有關閉,用以下指令查出存在的oracle程序:
$ ps -ef|grep ora
oracle 29976 1 0 Jun 22 ? 0:52 ora_dbw0_ora92
oracle 29978 1 0 Jun 22 ? 0:51 ora_dbw1_ora92
oracle 5128 1 0 Jul 5 ? 0:00 oracleora92 (LOCAL=NO)
然後用kill -9指令殺掉程序
$kill -9 <ID>
method 2
[root@qa-oracle dbs]# fuser -u lkNDMSQA
lkNDMSQA: 6666(oracle) 6668(oracle) 6670(oracle) 6672(oracle) 6674(oracle) 6676(oracle) 6678(oracle) 6680(oracle) 6690(oracle) 6692(oracle) 6694(oracle) 6696(oracle) 6737(oracle) 6830(oracle)
果然該檔案沒釋放,用fuser指令kill掉:
[root@qa-oracle dbs]# fuser -k lkNDMSQA
lkNDMSQA: 6666 6668 6670 6672 6674 6676 6678 6680 6690 6692 6694 6696 6737 6830
總結:
當發生1102錯誤時,可以按照以下流程檢查、排錯:
如果是HA系統,檢查其他節點是否已經啟動執行個體;
檢查Oracle程序是否存在,如果存在則殺掉程序;
檢查信号量是否存在,如果存在,則清除信号量;
檢查共享記憶體段是否存在,如果存在,則清除共享記憶體段;
檢查鎖記憶體檔案lk<sid>和sgadef<sid>.dbf是否存在,如果存在,則删除。
ORA-09968: unable to lock file lk$ORACLE_SID (2010-03-04 14:53)
分類: DBA
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
starting up 1 shared server(s) ...
Thu Mar 4 11:48:07 2010
ALTER DATABASE MOUNT
sculkget: failed to lock /u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS exclusive
sculkget: lock held by PID: 3443
ORA-09968: unable to lock file
Linux Error: 11: Resource temporarily unavailable
Additional information: 3443
提示程序3443鎖定該資源,根據上次的啟動日志發現該程序是Oracle的背景程序
DBWR,根據文檔提示236794.1可能是該程序已經挂死,導緻資料庫無法正常運作。
fuser -u /u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS
PMON started with pid=2, OS id=3437
MMAN started with pid=4, OS id=3441
PSP0 started with pid=3, OS id=3439
DBW0 started with pid=5, OS id=3443
LGWR started with pid=6, OS id=3445
CKPT started with pid=7, OS id=3447
SMON started with pid=8, OS id=3449
RECO started with pid=9, OS id=3451
CJQ0 started with pid=10, OS id=3453
MMON started with pid=11, OS id=3455
Tue Feb 16 11:08:17 2010
MMNL started with pid=12, OS id=3457
Tue Feb 16 11:08:18 2010
Tue Feb 16 11:08:22 2010
Setting recovery target incarnation to 2
Successful mount of redo thread 1, with mount id 1844152034
Database mounted in Exclusive Mode
Completed: ALTER DATABASE MOUNT
ALTER DATABASE OPEN
losf 檢視鎖定程序
# lsof |grep lkFDS
oracle 4476 oracle 17uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4478 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4480 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4482 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4484 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4486 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4488 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4490 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4492 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4494 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4496 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4513 oracle 15u REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4531 oracle 15u REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4534 oracle 15u REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4812 oracle 15u REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
fuser檢視鎖定程序
# fuser -u /u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS
/u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS: 4476(oracle) 4478(oracle) 4480(oracle) 4482(oracle) 4484(oracle) 4486(oracle) 4488(oracle) 4490(oracle) 4492(oracle) 4494(oracle) 4496(oracle) 4513(oracle) 4531(oracle) 4534(oracle) 4812(oracle)
[root@CHN-DG-3-5CE ~]#
請教fuser的作用及具體用法!
fuser Command
Purpose
Identifies processes using a file or file structure.
Syntax
fuser [ -c | -d | -f ] [ -k ] [ -u ] [ -x ] [ -V ]File ...
Description
The fuser command lists the process numbers of local processes that use the
local or remote files specified by the File parameter. For block special
devices, the command lists the processes that use any file on that device.
c Uses the file as the current directory.
e Uses the file as a program's executable object.
r Uses the file as the root directory.
s Uses the file as a shared library (or other loadable object).
The process numbers are written to standard output in a line with spaces between
process numbers. A new line character is written to standard error after the
last output for each file operand. All other output is written to standard
error.
The fuser command will not detect processes that have mmap regions where that
associated file descriptor has since been closed.
Flags
-c Reports on any open files in the file system containing File.
-d Implies the use of the -c and -x flags. Reports on any open files which have
been unlinked from the file system (deleted from the parent directory). When
of the deleted file.
-f Reports on open instances of File only.
-k Sends the SIGKILL signal to each local process. Only the root user can kill a
process of another user.
-u Provides the login name for local processes in parentheses after the process
number.
-V Provides verbose output.
-x Used in conjunction with -c or -f, reports on executable and loadable objects
in addition to the standard fuser output.
Examples
1. To list the process numbers of local processes using the /etc/passwd file,
enter:
fuser /etc/passwd
2. To list the process numbers and user login names of processes using the
fuser -u /etc/filesystems
3. To terminate all of the processes using a given file system, enter:
fuser -k -x -u /dev/hd1 -OR-
fuser -kxuc /home
Either command lists the process number and user name, and then terminates
each process that is using the /dev/hd1 (/home) file system. Only the root
user can terminate processes that belong to another user. You might want to
use this command