天天看點

資料庫巡檢項

ORACLE

db_cache_size 無steal情況;

shared_pool_size 無steal情況;

free memory大于1024M;

pga_aggregate_target    無overallco情況;    
log_buffer    大于128M    
session_cached_cursors    使用率低于60%    
OPEN CURSOR    使用率低于60%    
db_files     使用率低于60%    
DATA  tablespace Usage    使用率低于80%    
UNDOTBS  Usage    active & unexpired使用率低于80%    
Temporary tablespace Usage    使用率低于60%    
redo log size & group        
FRA Usage    使用率低于80%    
processes    曆史峰值低于80%    
block_corruption    無壞塊    
DATA 卷空間檢查    适用于filesystem、核心DATA卷空間使用率低于80%    
TOP SQL                    

MONGODB

CG配置

檢查cg配置:

/opt/cgtools/cginfo -t perf -s cpu

/opt/cgtools/cginfo -t perf -s mem

路由、配置、資料節點CPU使用率        "檢查zabbix監控,有超過70%的cg配置,則需升配           
http://hidba.source.com.cn/hidba/minitor/mongodb

調高cpu使用率同時,需關聯調高memory配置,同時評估應用連接配接數是否需同步調整。

資料節點記憶體使用率

檢查zabbix監控,有超過70%的cg配置,則需升配

連接配接數配置        
非分片架構:           

主、從、仲裁連接配接數需設定相同

連接配接數有超70%,則需評估升配,調高連接配接數的同時,需關聯觀察節點cpu、memory同步調高。

分片架構:

Mongos端連接配接數調高,需關聯調整配置層、shard層節點的連接配接數。關聯關系:

Mongod連接配接數=單個mongos配置連接配接數 * mongos數量+200

Mongod連接配接數調整,需關聯調整memory、cpu的配置;

Mongod memory配置調整,需關聯調整db cacheSizeGB參數

replication 狀态        
1、oplogsize檢查           

rs.printReplicationInfo()

若目前oplogsize小于5—7天變化量,則需擴容oplogsize

2、lag檢查

rs.printSlaveReplicationInfo()

如果oplog lag超過2小時,則需檢查從庫狀态,是否有慢操作,是否hang。可重新開機從庫。

cluster 整體狀态        
非分片架構:           

rs.status()

有primary存在,且僅有1個primary存在

replicaSet中所有節點statemsg為PRIMARY、SECONDARY、ARBITER三态,若有其他狀态異常;health=1,否則異常

Sh.status()

配置層、分片層副本狀态正常:rs.status()

分片層未有丢失,分片均衡滿足目前設定,未有jumbo chunk存在。

rebalance狀态        
在mongos端執行:           

sh.getBalancerState() 确認均衡狀态為目前您需要控制的狀态。

停止均衡方法:

sh.stopBalancer()

檢查停止均衡結果:

sh.getBalancerState()

use config

while( sh.isBalancerRunning() ) {

print(""waiting..."");
      sleep(1000);           

}

啟動均衡方法:

sh.startBalancer()

sh.getBalancerState() 确認結果為true

MYSQL CG配置 "檢查cg配置:

/opt/cgtools/cginfo -t perf -s mem (記憶體使用率盡量低于80%)

CPU使用率        ZABBIX監控、GRAFANA監控
檢查keepalived環境        檢視/var/log/messages
檢視錯誤日志        檢視error.log日志是否有錯誤資訊
連接配接數配置        ZABBIX監控、GRAFANA監控
檢視慢sql        檢查slow.log裡面是否有活動的慢sql,如有提前進行優化
檢查連接配接數情況    SLEEP狀态 + QUERY狀态 會話數量< 使用者最大連接配接數*60%    
 MYSQL > select  db,user,command,count(*) from information_schema.processlist group by db,user,command order by 4 desc;
           

MYSQL > show variables like '%connections%';

InnoDB Buffer命中率情況    不低于90%    "mysql> show status like'innodb_buffer_pool_read%';
           

命中率 = (1 - innodb_buffer_pool_reads / innodb_buffer_pool_read_requests) * 100%"

檢視同步情況        show slave status\G           

REDIS CG配置 "檢查cg配置:

CPU使用率        ZABBIX監控、GRAFANA監控
記憶體使用率        ZABBIX監控、GRAFANA監控
連接配接數配置        ZABBIX監控、GRAFANA監控
BIG KYES提前檢查        [PRD-DMZREC:cnsz081257:6455:S ~]$r --bigkeys
資料分布情況檢查    cluster各節點資料分布均衡    
檢視GRAFANA監控各節點記憶體使用量;           

$REDIS_HOME/bin/redis-trib.rb info --password $pw IP:PORT ---檢視keys分布情況

Slowlog檢查    開發确認業務增長量    127.0.0.1:6472> slowlog get 10           

PG 執行個體CG配置設定CPU容量及使用率

$ cg

==================== Cgroup Performance: cpu ====================

DB_TYPE INSTANCE_NAME CPU_USER CPU_SYS CPU_USED CPU_ALLO ALLO_RATE CPU_GLOB GLOB_RATE

------- ------------- -------- ------- -------- -------- --------- -------- ---------

postgres eits 4.5 0.9 5.6 20 28.0% 20 28.0%

==================== Cgroup Performance: memory ====================

DB_TYPE INSTANCE_NAME MEM_OOM MEM_FILE_GB MEM_MAP_GB MEM_USED_GB MEM_ALLO_GB ALLO_RATE MEM_GLOB_GB GLOB_RATE

------- ------------- ------- ----------- ---------- ----------- ----------- --------- ----------- ---------

postgres eits 0 0.0 0.0 0.0 0.0 0.0% 251 0.0%

==================== Cgroup Performance: proc ====================

DB_TYPE INSTANCE_NAME DATA_VOLUME VOLUME_FUSER PROC_COUNT

------- ------------- ----------- ------------ ----------

postgres eits /source/pg5521/data postgres:143 142"

最大連接配接數        select setting from pg_settings  where name ~'max_connections';
檢視shared_buffers        
# show shared_buffers ;           

shared_buffers

40GB

(1 row)

Time: 1.495 ms"

緩存命中率(90%以上為OK)        
\c 切換到業務database           

select t.snap_time,round(CAST ( (100*(t.blks_hit-t.pre_hit)/((t.blks_read-t.pre_read)+(t.blks_hit-t.pre_hit+1))) AS numeric) ,2)::varchar as hitrate

from

(select snap_time,blks_hit,blks_read,lag(b.blks_hit) over (partition by 1 order by s.snap_time) pre_hit,

lag(b.blks_read) over (partition by 1 order by s.snap_time) pre_read

from pgawr_database_v b,pgawr_snap s where b.snapid = s.snapid and datname=current_database() order by s.snap_time desc) t limit 20;"

data卷容量        
$ df -TH|grep $PORT
                 vxfs    22G  373M   21G   2% /source/pg5521/app
                 vxfs    37T   23T   14T  64% /source/pg5521/data
備份卷容量        
$ df -TH|grep pgbackup
                 nfs     23T   17T  5.5T  76% /source/pgbackup
                 
事物ID檢查        
查詢database事物年齡及百分比           

Select datname,age(datfrozenxid),2^31 - age(datfrozenxid) left_age,round(age(datfrozenxid)/2^31::numeric,4)*100||'%' age_used_pct from pg_database;

查詢表事物年齡及百分比

SELECT relname, pg_size_pretty(pg_table_size(oid)) as table_size,age(relfrozenxid),2^31-age(relfrozenxid) as left_age,round(age(relfrozenxid)/2^31::numeric,4)*100||'%' age_used_pct FROM pg_class where Relkind in ('r','t') order by 2 desc;"

超過10小時長事物        select 'pg.long_transaction:'||count(*) from pg_stat_activity where state <> 'idle' and now() - xact_start > '36000 sec'::interval;
是否有阻塞        "with recursive t_wait as (select a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid ,a.transactionid from pg_locks a where not a.granted),           

t_run as (select a.mode,a.locktype,a.database,a.relation,a.page,a.tuple,a.classid,a.objid,a.objsubid,a.pid,a.virtualtransaction,a.virtualxid,a,transactionid,b.query,b.xact_start,b.query_start

,b.usename,b.datname from pg_locks a,pg_stat_activity b

where a.pid=b.pid and a.granted),

w as(select r.pid r_pid, w.pid w_pid

from t_wait w,t_run r where

r.locktype is not distinct from w.locktype and

r.database is not distinct from w.database and

r.relation is not distinct from w.relation and

r.page is not distinct from w.page and

r.tuple is not distinct from w.tuple and

r.classid is not distinct from w.classid and

r.objid is not distinct from w.objid and

r.objsubid is not distinct from w.objsubid and

r.transactionid is not distinct from w.transactionid and

r.virtualxid is not distinct from w.virtualxid)

,c(waiter, holder, root_holder, path, deep) as(

select w_pid, r_pid, r_pid, w_pid||'->'||r_pid, 1 from w

union

select w_pid, r_pid, c.holder, w_pid||'->'||c.path, c.deep+1 from w t, c where t.r_pid = c.waiter

)

select t1.waiter, t1.holder, t1.root_holder, path, t1.deep from c t1 where

not exists(select 1 from c t2 where t2.path ~ t1.path and t1.path<>t2.path )

Order by root_holder;

表鎖        select a.locktype,a.pid,a.relation,a.mode,a.granted,b.relname from pg_locks a,pg_class b where a.relation=b.oid and a.mode='AccessExclusiveLock';
備份        $ pg_rman show           

或者$pg_probackup show

檢視主從同步狀态                   

1、 pg lag 需要在主庫執行,salve_addr為主機實體ip,如果沒有對應的行說明主從延遲過大或者無從庫

2、 檢視從庫延遲日志大小

select client_addr,pg_xlog_location_diff(pg_current_xlog_location(),replay_location)/1024/1024 from pg_stat_replication;

AUTOVACUUM情況    1、是否存在有長時間不被AUTOVACUUM的對象;           

2、是否存在DEAD TUPPLE占比接近20%的情況;

3、是否存在DEAD TUPPLE占比高的對象;

頻繁更新的表情況    大于0就需要檢查原因    "select tablename,table_updates,table_deletes,table_inserts           

from (

SELECT distinct a.table_name as tablename, 
    ((max(a.n_tup_upd) over (partition by a.table_name))-(min(a.n_tup_upd) over (partition by a.table_name))) as table_updates ,
    ((max(a.n_tup_del) over (partition by a.table_name))-(min(a.n_tup_del) over (partition by a.table_name))) as table_deletes,
    ((max(a.n_tup_ins) over (partition by a.table_name))-(min(a.n_tup_ins) over (partition by a.table_name))) as table_inserts
    FROM pgawr_tables_v a 
    join  pgawr_snap c  on a.snapid=c.snapid 
    WHERE  c.snap_time>=current_date - interval '1 D'
    AND c.snap_time<=current_date
   and a.table_name !~'pg_toast|dbmgr|information_schema|pg_catalog|pgagent' 
   group by a.table_name,a.n_tup_upd,a.n_tup_del,a.n_tup_ins
   ) tb where round(table_updates::numeric/(table_inserts+table_deletes)::numeric,2)>1 and (table_inserts+table_deletes)>0 and table_updates>100*1000 order by table_updates desc limit 20;
           
.達到vacuum觸發條件,但是7天内未完成vacuum    有記錄傳回就需要檢查原因    "select current_database(),t1.schemaname,t1.relname,
        case when last_autovacuum is null and last_vacuum is null then current_date - interval '5 D' 
             when last_autovacuum is null and last_vacuum is not null then last_vacuum 
             when last_autovacuum is not null and last_vacuum is null then last_autovacuum 
             else (case when last_autovacuum>=last_vacuum then last_autovacuum else last_vacuum end) end as last_time,
                        t1.n_live_tup,n_dead_tup,pg_size_pretty(pg_table_size(t2.oid)) as table_size,age(t2.relfrozenxid)
        from pg_stat_all_tables t1
   left join pg_class t2
             on t1.relname=t2.relname
  inner join pg_namespace t4
          on t1.schemaname=t4.nspname and t2.relnamespace=t4.oid  
       where  t1.n_live_tup>0
         and ( age(t2.relfrozenxid)>1500000000
         or t1.n_dead_tup/ ( t1.n_live_tup + t1.n_dead_tup ) *100>=10)
         and case when last_autovacuum is null and last_vacuum is null then current_date - interval '5 D' 
             when last_autovacuum is null and last_vacuum is not null then last_vacuum 
             when last_autovacuum is not null and last_vacuum is null then last_autovacuum 
             else (case when last_autovacuum>=last_vacuum then last_autovacuum else last_vacuum end) end + interval '7 D'< current_timestamp 
                         order by age(t2.relfrozenxid) desc limit 20;

BIG TABLES    大于100G的表需要安排分區改造    select current_database(),relname,pg_table_size(oid) tab_size from pg_class where relkind in ('r','t') and pg_table_size(oid)/1024/1024/1024>=100;
pg_class對象數,pathman分區表數    "PG_CLASS大于10W;           

PATHMAN分區表數大于1W;

需要安排檢查; select current_database(),(select count(1) from pg_class ) as "pg_class",(select count(1) from pg_tables) as "tables",(select count(1) from pg_indexes) as "indexes",(select count(1) from pathman_partition_list) as "partitions";