天天看点

云贝教育 |「技术文章」关闭OGG进程失败提示有长事务解决方案

作者:云贝教育

作者:炎燚小寳

原文链接:http://www.tdpub.cn/Blog/detail/id/1292.html

一、事件概述

客户反馈OGG进程无法停止,提示有大事务在运行,无法停止,最终定位到异常会话的游标没有正常关闭导致

二、故障分析

(一)故障现象

数据库2节点OGG的EXT进程无法停止。

(二)故障分析

1. 查看OGG日志。

GGSCI (xx2) 8> stop extract EJ1_FP
Sending STOP request to EXTRACT EJ1_FP ...
STOP request pending. There are open, long-running transactions.
Before you stop Extract, make the archives containing data for those transactions available for when Extract restarts.
To force Extract to stop, use the SEND EXTRACT EJ1_FP, FORCESTOP command.
Oldest redo log file necessary to restart Extract is:
Redo Log Sequence Number 616, RBA 135673360.
2022-10-21 10:55:41 WARNING OGG-01742 Command sent to EXTRACT EJ1_FP returned with an invalid response.           

2. 查看当前的长事务。

GGSCI (xx2) 8> send extract EJ1_FP,showtrans duration 20 MIN
Sending showtrans request to EXTRACT EJ1_FP ...
Oldest redo log files necessary to restart Extract are:
Redo Thread 1, Redo Log Sequence Number 47308, SCN 4585.1856782301 (19694281834461), RBA 575467024
Redo Thread 2, Redo Log Sequence Number 70563, SCN 4585.1916741249 (19694341793409), RBA 310660624
------------------------------------------------------------
XID: 214.19.198565
Items: 0
Extract: EJ1_FP
Redo Thread: 1
Start Time: 2022-10-13:09:40:20
SCN: 4585.1856782301 (19694281834461)
Redo Seq: 47308
Redo RBA: 575467024
Status: Running
------------------------------------------------------------
XID: 234.31.119068
Items: 0
Extract: EJ1_FP
Redo Thread: 1
Start Time: 2022-10-13:10:45:36
SCN: 4585.1858630286 (19694283682446)
Redo Seq: 47378
Redo RBA: 817386000
Status: Running
------------------------------------------------------------
XID: 6.19.504269
Items: 0
Extract: EJ1_FP
Redo Thread: 1
Start Time: 2022-10-18:13:22:51
SCN: 4585.1922052463 (19694347104623)
Redo Seq: 47529
Redo RBA: 246746640
Status: Running
------------------------------------------------------------
XID: 54.10.981207
Items: 0
Extract: EJ1_FP
Redo Thread: 2
Start Time: 2022-10-18:08:19:13
SCN: 4585.1916741249 (19694341793409)
Redo Seq: 70563
Redo RBA: 310660624
Status: Running           

3. 通过UNDO事务定位到SQL源头

select inst_id,saddr,sid,serial#,paddr,username,sql_id,status from gv$session where taddr in (select addr from gv$transaction where xidusn in (54,6,234,214))
登录用户为sys,程序类型为sqlplus发起
SQLID:g72kdvcacxvtf
DELETE FROM PLAN_TABLE WHERE STATEMENT_ID=:1
SQLID:9babjv8yq8ru3
BEGIN DBMS_OUTPUT.GET_LINES(:LINES, :NUMLINES); END;           

4. 综合以上日志初步分析:ORACLE在执行DELETE FROM PLAN_TABLE WHERE STATEMENT_ID=:1一直未提交,导致长事务产生,进一步影响OGG进程关闭。

三、故障定位

根据以上日志报告分析,ORACLE内部递归调用的删除SQL一直未提交,产生长事务,为此次故障的根本原因。

四、故障解决措施及建议

(一)解决措施

如果我们单纯为了解决问题,KILL对应的长事务即可。但如果想知道的问题根源,该怎么办?

五、模拟重现长事务等待

SQL中涉及的目标表是PLAN_TABLE,跟执行计划有关,同时源头用户是SYS且程序为sqlplus,是不是有会话通过操作系统认证登录数据库查看执行计划了?而DBMS_OUTPUT.GET_LINES函数只是输出执行计划附带的而已。

接下来进行模拟测试

---session1---
sqlplus / as sysdba登录数据库,生成一条SQL的执行计划,不退出会话。此操作用10046记录。
-bash-4.1$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Sat Oct 15 06:15:39 2022
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> oradebug setmypid
alter session set events '10046 trace name context forever ,level 12';
Statement processed.
SQL>
Session altered.
SQL> set autot trace
SQL> select 1 from t1;
Execution Plan
----------------------------------------------------------
Plan hash value: 2990879143
-----------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-----------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 1 (0)| 00:00:01 |
| 1 | INDEX FULL SCAN | IDX_IDNAME | 1 | 1 (0)| 00:00:01 |
-----------------------------------------------------------------------
Note
-----
- dynamic sampling used for this statement (level=2)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
1 consistent gets
0 physical reads
0 redo size
519 bytes sent via SQL*Net to client
524 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> alter session set events '10046 trace name context off';
Session altered.
SQL> oradebug close_trace
Statement processed.
SQL> oradebug tracefile_name
/u01/app/oracle/diag/rdbms/d11gr4/d11gr4/trace/d11gr4_ora_14810.trc
---session2---
查看是否有INACTIVE的事务
SQL> SELECT ktuxesiz FROM x$ktuxe WHERE ktuxesta != 'INACTIVE';
KTUXESIZ
----------
2
SQL> select addr,xidusn,xidslot,xidsqn,status from v$transaction ;
ADDR XIDUSN XIDSLOT XIDSQN STATUS
---------------- ---------- ---------- ---------- ----------------
00000000E6B17E38 9 24 1605 ACTIVE
SQL> select sid,username,sql_id,PREV_SQL_ID,sql_hash_value from v$session where taddr='00000000E6B17E38';
SID USERNAME SQL_ID PREV_SQL_ID SQL_HASH_VALUE
---------- ------------------------------ ------------- ------------- --------------
152 SYS g72kdvcacxvtf 0
SQL> select sql_fulltext from v$sql t where t.sql_id='g72kdvcacxvtf';
SQL_FULLTEXT
--------------------------------------------------------------------------------
DELETE FROM PLAN_TABLE WHERE STATEMENT_ID=:1
DELETE FROM PLAN_TABLE WHERE STATEMENT_ID=:1
DELETE FROM PLAN_TABLE WHERE STATEMENT_ID=:1
--session 3--
查看10046日志,果然找到上述SQL
PARSING IN CURSOR #18446604434621518272 len=45 dep=0 uid=0 oct=7 lid=0 tim=237421908196 hv=349105966 ad='e11266d0' sqlid='g72kdvcacxvtf'
DELETE FROM PLAN_TABLE WHERE STATEMENT_ID=:1
END OF STMT
PARSE #18446604434621518272:c=45,e=45,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2637181423,tim=237421908195
BINDS #18446604434621518272:
Bind#0
oacdty=01 mxl=128(42) mxlc=00 mal=00 scl=00 pre=00
oacflg=03 fl2=1000000 frm=01 csi=873 siz=128 off=0
kxsbbbfp=ffff80ffbdc5d8b0 bln=128 avl=14 flg=05
value="PLUS4294967295"           

从以上测试结果可以验证,就是用户通过操作系统认证以sys登录数据库,查看执行计划,未关闭会话导致的长事务,如何避免该问题咧?查看完执行计划之后关闭会话即可,这样会话对应的游标会自动提交,有兴趣的读者可以验证此点。

六、处理结果

通过以上处理,长事务己消失,OGG进程可以正常关闭。

GGSCI (XX2) 9> send extract EJ1_FP,showtrans duration 20 MIN
Sending showtrans request to EXTRACT EJ1_FP ...
No transactions found
Oldest redo log files necessary to restart Extract are:
Redo Thread 1, Redo Log Sequence Number 47308, SCN 4585.1856782301 (19694281834461), RBA 575467024
Redo Thread 2, Redo Log Sequence Number 70563, SCN 4585.1916741249 (19694341793409), RBA 310660624.            

继续阅读