我們在IO問題發生時間段:
1,可以查dba_hist_system_event 來看該IO相關等待時間的平均響應時間的變化趨勢,
2,直接查dba_hist_event_histogram或者sys.WRH$_EVENT_HISTOGRAM , 來看該IO相關等待事件的等待時間直方圖分布。
3, 我們也可以從整體資料庫的IO變化量上(dba_hist_sysstat), 看出問題時段的總體IO量, 比正常時候的IO量到底是否有增加。
4, 我們也可以從某個IO function 即某個程序上面(dba_hist_iostat_function),看問題時間段某個程序的IO request次數、IO 的量。
比如說LGWR程序的寫request次數、 寫的量。
說明:
1, 如上視圖均可關聯dba_hist_snapshot
2, 所有dba_hist 相關視圖都是由對應的v$ 動态性能視圖彙聚而來, 當時的資料可以查對應的v$視圖。
福利: 後附sql。
例2:
直方圖分布可能更加精确, 因為檢視平均響應時間可能會因為粒度太粗、時間跨度長而被平均了。
正常情況下,應該<8ms的占據了90%甚至更多, > 32ms 的應該占比接近為0。
可以看到, 昨天smscp異常時候,在16 – 32 ms 之間的已經占比很多了。
END_INTERVAL_TIME SNAP_ID TIME_HISTG TOTAL_WAIT PER_OF_WAITS
1 22-10月-14 03.00.49.801 下午 20536 <1ms 89464 0.71
2 22-10月-14 03.00.49.801 下午 20536 <2ms 89464 11.32
3 22-10月-14 03.00.49.801 下午 20536 <4ms 89464 26.39
4 22-10月-14 03.00.49.801 下午 20536 <8ms 89464 18.28
5 22-10月-14 03.00.49.801 下午 20536 <16ms 89464 13.16
6 22-10月-14 03.00.49.801 下午 20536 <32ms 89464 8.21
7 22-10月-14 03.00.49.801 下午 20536 <64ms 89464 7.56
8 22-10月-14 03.00.49.801 下午 20536 <128ms 89464 6.06
9 22-10月-14 03.00.49.801 下午 20536 <256ms 89464 4.72
10 22-10月-14 03.00.49.801 下午 20536 <512ms 89464 2.72
11 22-10月-14 03.00.49.801 下午 20536 <1s 89464 0.83
12 22-10月-14 03.00.49.801 下午 20536 <2s 89464 0.03
13 22-10月-14 03.00.49.801 下午 20536 <4s 89464 0
14 22-10月-14 03.00.49.801 下午 20536 >=4s 89464 0
正常情況下是這樣的,8ms以上的占比很少:
END_INTERVAL_TIME SNAP_ID TIME_HISTG TOTAL_WAIT PER_OF_WAITS
1 22-10月-14 07.00.49.701 下午 20540 <1ms 1091640 38.28
2 22-10月-14 07.00.49.701 下午 20540 <2ms 1091640 51.09
3 22-10月-14 07.00.49.701 下午 20540 <4ms 1091640 9.03
4 22-10月-14 07.00.49.701 下午 20540 <8ms 1091640 1.37
5 22-10月-14 07.00.49.701 下午 20540 <16ms 1091640 0.17
6 22-10月-14 07.00.49.701 下午 20540 <32ms 1091640 0.04
7 22-10月-14 07.00.49.701 下午 20540 <64ms 1091640 0.01
8 22-10月-14 07.00.49.701 下午 20540 <128ms 1091640 0
9 22-10月-14 07.00.49.701 下午 20540 <256ms 1091640 0
10 22-10月-14 07.00.49.701 下午 20540 <512ms 1091640 0
11 22-10月-14 07.00.49.701 下午 20540 <1s 1091640 0
12 22-10月-14 07.00.49.701 下午 20540 <2s 1091640 0
13 22-10月-14 07.00.49.701 下午 20540 <4s 1091640 0
14 22-10月-14 07.00.49.701 下午 20540 >=4s 1091640 0
例3:
可以看到昨天下午的整庫的physical read total IO requests 并沒有明顯增加, 但是physical read total bytes 增加到了3000GB 一個小時。
其它時候也有這麼高的,但是都是在晚上, 隻有昨天是在下午15點這個生産時間的snapshot,開發有大量direct path read 是以對應用産生了較大影響。
例4:可以看到問題發生時段,10.22 14點-15點, LGWR的large_write_megabytes 和requests有明顯增加, 而 平時有如此高峰的是有晚上和淩晨。
Smll read 的bytes和request 都沒有增加。
附sql:
--- 查等待事件的等待時間直方圖分布:
select b.end_interval_time,a.*
from
(select snap_id,
decode(wait_time_milli,
1,
'<1ms',
2,
'<2ms',
4,
'<4ms',
8,
'<8ms',
16,
'<16ms',
32,
'<32ms',
64,
'<64ms',
128,
'<128ms',
256,
'<256ms',
512,
'<512ms',
1024,
'<1s',
2048,
'<2s',
4096,
'<4s',
8192,
'>=4s',
wait_time_milli) time_histg,
total_wait,
round(delta_wait_count / total_wait * 100, 2) as per_of_waits
from (select snap_id,
wait_time_milli,
delta_wait_count,
sum(delta_wait_count) over(partition by snap_id order by snap_id) as total_wait
from (select xx.snap_id,
xx.wait_time_milli,
xx.wait_count - yy.wait_count as delta_wait_count
from dba_hist_event_histogram xx,
dba_hist_event_histogram yy
where xx.event_name = 'log file parallel write'
and xx.event_id = yy.event_id
and xx.snap_id = yy.snap_id + 1
and xx.instance_number = yy.instance_number
and xx.wait_time_milli = yy.wait_time_milli
and xx.snap_id in (20540)))
order by snap_id, wait_time_milli) a , dba_hist_snapshot b
where a.snap_id=b.snap_id;
--- 查等待事件的平均等待時間的變化趨勢:
with t as (
select s.snap_id, s.instance_number, s.end_interval_time ,
total_waits - lag(total_waits, 1) over (partition by s.instance_number order by s.snap_id) waits,
(time_waited_micro - lag(time_waited_micro, 1) over (partition by s.instance_number order by s.snap_id))/1000 twt
from dba_hist_system_event ev, dba_hist_snapshot s
where ev.instance_number = s.instance_number and ev.snap_id = s.snap_id
and event_name = 'log file parallel write'
and s.end_interval_time BETWEEN TO_DATE('20140401 10:00:00', 'yyyymmdd hh24:mi:ss') AND TO_DATE('20140414 11:00:00', 'yyyymmdd hh24:mi:ss') )
select to_char(end_interval_time,'YYYYMMDD HH24:MI'), instance_number, sum(waits), sum(twt), round(sum(twt)/sum(waits), 2) wt
from t
where (to_char(end_interval_time, 'hh24') between 9 and 12 or to_char(end_interval_time, 'hh24') between 14 and 17)
group by to_char(end_interval_time,'YYYYMMDD HH24:MI'), instance_number
order by 1, instance_number;
--- 查整庫IO相關名額:
--get_stat_10g.sql
with s as (
select sn.instance_number, sn.snap_id, sn.end_interval_time stime, (s.value - lag(s.value, 1) over (partition by sn.instance_number order by sn.snap_id)) value
--(sn.end_interval_time - lag(sn.end_interval_time, 1) over (order by sn.snap_id))2460*60 elap_time,
--round((s.value - lag(s.value, 1) over (order by sn.snap_id))/((sn.end_interval_time - lag(sn.end_interval_time, 1) over (order by sn.snap_id))2460*60)) value_s
from dba_hist_snapshot sn, dba_hist_sysstat s
where s.snap_id = sn.snap_id and s.stat_name = '&1' and s.instance_number = sn.instance_number
)
select instance_number , snap_id, stime, value ,
rank() over (partition by instance_number order by value desc) ranking, min(value) over (partition by instance_number) min_phy,
max(value) over (partition by instance_number) max_phy, round(avg(value) over (partition by instance_number)) avg_phy
from s
where (to_char(stime, 'hh24') between 9 and 12 or to_char(stime, 'hh24') between 14 and 17)
and to_char(stime, 'D') not in ('1', '7')
and stime between to_date(&2, 'yyyymmdd') and to_date(&3, 'yyyymmdd')
order by instance_number, snap_id
/
---eg: @get_stat_10g.sql 'parse count (hard)' 20100716 20100726
---physical read total IO requests
---physical read total bytes
---physical write total IO requests
---physical write total bytes
---以上四個名額即IOPS(requests)、IO吞吐量(bytes)
--- 查整庫IO相關名額2:
select e.snap_id,
to_char(e.startup_time, 'yyyy-mm-dd hh24:mi:ss') instance_startup_time,
to_char(e.end_interval_time, 'yyyy-mm-dd hh24:mi:ss') snapshot_end_time,
b.value - a.value IO_requests,
round((d.value - c.value)/1024/1024/1024) IO_GB
from dba_hist_sysstat a,
dba_hist_sysstat b,
dba_hist_sysstat c,
dba_hist_sysstat d,
dba_hist_snapshot e
where a.stat_name = 'physical read total IO requests'
and b.stat_name = 'physical read total IO requests'
and c.stat_name = 'physical read total bytes'
and d.stat_name = 'physical read total bytes'
and a.snap_id = e.snap_id - 1
and b.snap_id = e.snap_id
and c.snap_id = e.snap_id - 1
and d.snap_id = e.snap_id
and e.end_interval_time BETWEEN
TO_DATE('20141021 09:00:00', 'yyyymmdd hh24:mi:ss') AND
TO_DATE('20141023 18:00:00', 'yyyymmdd hh24:mi:ss')
order by e.begin_interval_time;
--- 查某個IO 程序的讀寫名額, 例如LGWR:
SQL> select * from dba_hist_iostat_function_name;
DBID FUNCTION_ID FUNCTION_NAME
---------- ----------- ------------------------------
1154030563 0 RMAN
1154030563 1 DBWR
1154030563 2 LGWR
1154030563 3 ARCH
1154030563 4 XDB
1154030563 5 Streams AQ
1154030563 6 Data Pump
1154030563 7 Recovery
1154030563 8 Buffer Cache Reads
1154030563 9 Direct Reads
1154030563 10 Direct Writes
1154030563 11 Smart Scan
1154030563 12 Archive Manager
1154030563 13 Others
14 rows selected
SQL>
--- 查某個IO程序例如LGWR的IO requset和IO bytes dba_hist_iostat_function:
to_char(e.end_interval_time, 'yyyy-mm-dd hh24:mi:ss') snapshot_end_time,
b.SMALL_READ_MEGABYTES - a.SMALL_READ_MEGABYTES SMALL_READ_MEGABYTES ,
b.SMALL_WRITE_MEGABYTES - a.SMALL_WRITE_MEGABYTES SMALL_WRITE_MEGABYTES ,
b.LARGE_READ_MEGABYTES - a.LARGE_READ_MEGABYTES LARGE_READ_MEGABYTES ,
b.LARGE_WRITE_MEGABYTES - a.LARGE_WRITE_MEGABYTES LARGE_WRITE_MEGABYTES ,
b.SMALL_READ_REQS - a.SMALL_READ_REQS SMALL_READ_REQS ,
b.SMALL_WRITE_REQS - a.SMALL_WRITE_REQS SMALL_WRITE_REQS ,
b.LARGE_READ_REQS - a.LARGE_READ_REQS LARGE_READ_REQS ,
b.LARGE_WRITE_REQS - a.LARGE_WRITE_REQS LARGE_WRITE_REQS ,
b.NUMBER_OF_WAITS - a.NUMBER_OF_WAITS NUMBER_OF_WAITS ,
b.WAIT_TIME - a.WAIT_TIME WAIT_TIME
from dba_hist_iostat_function a,
dba_hist_iostat_function b,
dba_hist_snapshot e
where a.FUNCTION_NAME = 'LGWR'
and b.FUNCTION_NAME = 'LGWR'
TO_DATE('20141021 09:00:00', 'yyyymmdd hh24:mi:ss') AND
TO_DATE('20141023 18:00:00', 'yyyymmdd hh24:mi:ss')