故障描述: 与客户沟通, 初步确认故障范围大概是在上午的 8:30-10:30 之间, 反应故障现象是 Tomcat 的连接数满导致应用无法连接, 数据库 alert 中无明显报错, 需要协助排查原因
1. 导入包含故障时刻的数据
2. 创建 m_ash 表, 明确故障时刻
3. 确定异常时刻的 top n event
4. 确定最终的 top holder
5. 总结
6.reference
1. 导入包含故障时刻的数据
为了便于后续分析, 我向客户索要了从昨天下午 13:00 到今天 18:00 的 awrdump, 导入到自己的实验环境进行分析
生产环境导出 awrdump:
@?/rdbms/admin/awrextr
测试环境导入 awrdump:
- SYS@jyzhao1>select * from dba_directories;
- SYS@jyzhao1>create directory jy as '/home/oracle/awrdump';
- SYS@jyzhao1>select * from dba_directories;
- SYS@jyzhao1>!mkdir -p /home/oracle/awrdump
- SYS@jyzhao1>@?/rdbms/admin/awrload
省略部分输出..
- ... Dropping AWR_STAGE user
- End of AWR Load
2. 创建 m_ash 表, 明确故障时刻
创建 m_ash 表:
- --create table
- create table m_ash20180322 as select * from dba_hist_active_sess_history where dbid=&dbid;
输入生产库对应的 dbid, 完成创建分析表
- select to_char(sample_time, 'yyyy-mm-dd hh24:mi'), count(1)
- FROM m_ash20180322
- group by to_char(sample_time, 'yyyy-mm-dd hh24:mi')
- order by 1;
根据生成的数据生成折线图如下:
可以从图中明确故障时刻, 即在 10:0012:3014:10 这三个时刻会话都明显上升(积压), 看来客户的反馈时间点并没有包含所有异常时刻
另外, 引用下 maclean 的诊断脚本, 可以看到核心意思差不多, 只是进一步将 instance_number 区分开细化:
-- 验证导出的 ASH 时间范围:
- select
- t.dbid, t.instance_number, min(sample_time), max(sample_time), count(*) session_count
- from m_ash20180322 t
- group by t.dbid, t.instance_number
- order by dbid, instance_number;
-- 确认问题发生的精确时间范围:
- select
- dbid, instance_number, sample_id, sample_time, count(*) session_count
- from m_ash20180322 t
- group by dbid, instance_number, sample_id, sample_time
- order by dbid, instance_number, sample_time;
3. 确定异常时刻的 top n event
确定每个采样点的 top n event, 下面也是参考 maclean 的脚本
比如我这里以 2018-03-22 09:59:00 - 2018-03-22 10:00:00 为例:
- select t.dbid,
- t.sample_id,
- t.sample_time,
- t.instance_number,
- t.event,
- t.session_state,
- t.c session_count
- from (select t.*,
- rank() over(partition by dbid, instance_number, sample_time order by c desc) r
- from (select /*+ parallel 8 */
- t.*,
- count(*) over(partition by dbid, instance_number, sample_time, event) c,
- row_number() over(partition by dbid, instance_number, sample_time, event order by 1) r1
- from dba_hist_active_sess_history t
- where sample_time>
- to_timestamp('2018-03-22 09:59:00',
- 'yyyy-mm-dd hh24:mi:ss')
- and sample_time <to_timestamp('2018-03-22 10:00:00',
- 'yyyy-mm-dd hh24:mi:ss')
- ) t
- where r1 = 1) t
- where r < 3
- order by dbid, instance_number, sample_time, r;
其他异常时刻, 输入对应的变量值:
- select t.dbid,
- t.sample_id,
- t.sample_time,
- t.instance_number,
- t.event,
- t.session_state,
- t.c session_count
- from (select t.*,
- rank() over(partition by dbid, instance_number, sample_time order by c desc) r
- from (select /*+ parallel 8 */
- t.*,
- count(*) over(partition by dbid, instance_number, sample_time, event) c,
- row_number() over(partition by dbid, instance_number, sample_time, event order by 1) r1
- from dba_hist_active_sess_history t
- where sample_time>
- to_timestamp('&begin_sample_time',
- 'yyyy-mm-dd hh24:mi:ss')
- and sample_time <to_timestamp('&end_sample_time',
- 'yyyy-mm-dd hh24:mi:ss')
- ) t
- where r1 = 1) t
- where r < 3
- order by dbid, instance_number, sample_time, r;
- 2018-03-22 12:29:00
- 2018-03-22 12:30:00
- 2018-03-22 14:09:00
- 2018-03-22 14:10:00
综上, 3 个连接数堆积的异常时刻 TOP event 都是 enq: TX - row lock contention
4. 确定最终的 top holder
使用 maclean 的脚本, 观察每个采样点的等待链:
- select
- level lv,
- connect_by_isleaf isleaf,
- connect_by_iscycle iscycle,
- t.dbid,
- t.sample_id,
- t.sample_time,
- t.instance_number,
- t.session_id,
- t.sql_id,
- t.session_type,
- t.event,
- t.session_state,
- t.blocking_inst_id,
- t.blocking_session,
- t.blocking_session_status
- from m_ash20180322 t
- where sample_time>
- to_timestamp('2018-03-22 09:59:00',
- 'yyyy-mm-dd hh24:mi:ss')
- and sample_time <to_timestamp('2018-03-22 10:00:00',
- 'yyyy-mm-dd hh24:mi:ss')
- start with blocking_session is not null
- connect by nocycle
- prior dbid = dbid
- and prior sample_time = sample_time
- /*and ((prior sample_time) - sample_time between interval '-1'
- second and interval '1' second)*/
- and prior blocking_inst_id = instance_number
- and prior blocking_session = session_id
- and prior blocking_session_serial# = session_serial#
- order siblings by dbid, sample_time;
结果如下:
进一步筛选, 将 isleaf=1 的叶 (top holder) 找出来:
-- 基于上一步的原理来找出每个采样点的最终 top holder:
- select t.lv,
- t.iscycle,
- t.dbid,
- t.sample_id,
- t.sample_time,
- t.instance_number,
- t.session_id,
- t.sql_id,
- t.session_type,
- t.event,
- t.seq#,
- t.session_state,
- t.blocking_inst_id,
- t.blocking_session,
- t.blocking_session_status,
- t.c blocking_session_count
- from (select t.*,
- row_number() over(partition by dbid, instance_number, sample_time order by c desc) r
- from (select t.*,
- count(*) over(partition by dbid, instance_number, sample_time, session_id) c,
- row_number() over(partition by dbid, instance_number, sample_time, session_id order by 1) r1
- from (select /*+ parallel 8 */
- level lv,
- connect_by_isleaf isleaf,
- connect_by_iscycle iscycle,
- t.*
- from m_ash20180322 t
- where sample_time>
- to_timestamp('2018-03-22 09:59:00',
- 'yyyy-mm-dd hh24:mi:ss')
- and sample_time <to_timestamp('2018-03-22 10:00:00',
- 'yyyy-mm-dd hh24:mi:ss')
- start with blocking_session is not null
- connect by nocycle
- prior dbid = dbid
- and prior sample_time = sample_time
- /*and ((prior sample_time) - sample_time between interval '-1'
- second and interval '1' second)*/
- and prior blocking_inst_id = instance_number
- and prior blocking_session = session_id
- and prior
- blocking_session_serial# = session_serial#) t
- where t.isleaf = 1) t
- where r1 = 1) t
- where r < 3
- order by dbid, sample_time, r;
对其他异常时段进行分析:
- 2018-03-22 12:29:00
- 2018-03-22 12:30:00
- 2018-03-22 14:09:00
- 2018-03-22 14:10:00
- -- top holder: DIY sample_time
- select t.lv,
- t.iscycle,
- t.dbid,
- t.sample_id,
- t.sample_time,
- t.instance_number,
- t.session_id,
- t.sql_id,
- t.session_type,
- t.event,
- t.seq#,
- t.session_state,
- t.blocking_inst_id,
- t.blocking_session,
- t.blocking_session_status,
- t.c blocking_session_count
- from (select t.*,
- row_number() over(partition by dbid, instance_number, sample_time order by c desc) r
- from (select t.*,
- count(*) over(partition by dbid, instance_number, sample_time, session_id) c,
- row_number() over(partition by dbid, instance_number, sample_time, session_id order by 1) r1
- from (select /*+ parallel 8 */
- level lv,
- connect_by_isleaf isleaf,
- connect_by_iscycle iscycle,
- t.*
- from m_ash20180322 t
- where sample_time>
- to_timestamp('&begin_sample_time',
- 'yyyy-mm-dd hh24:mi:ss')
- and sample_time <to_timestamp('&end_sample_time',
- 'yyyy-mm-dd hh24:mi:ss')
- start with blocking_session is not null
- connect by nocycle
- prior dbid = dbid
- and prior sample_time = sample_time
- /*and ((prior sample_time) - sample_time between interval '-1'
- second and interval '1' second)*/
- and prior blocking_inst_id = instance_number
- and prior blocking_session = session_id
- and prior
- blocking_session_serial# = session_serial#) t
- where t.isleaf = 1) t
- where r1 = 1) t
- where r < 3
- order by dbid, sample_time, r;
发现所有的异常时刻最终阻塞都是实例 1 的 sid 为 3548 的 session, 不再赘述
5. 总结
从第四步可以看到, top holder 都是实例 1, 会话 3548.
比如可以看到实例 1 的 481 会话被实例 2 的 6377 会话阻塞, 然后实例 2 的 6377 会话又被实例 1 的 3548 会话阻塞
通过 sql_id 可以查询到 sql 文本:
select * from dba_hist_sqltext where sql_id = '&sql_id';
可以看到实例 1 的 3548 会话当前正在执行的 SQL 只是一个查询语句, 当前会话状态是 ON CPU, 所以推测该会话之前有 DML 的事物未提交导致阻塞
去查询该会话的 DML 操作时, 也有 update 和 insert 操作, 但是 update 操作已经无法找到对应 SQL 文本
- select t.event, t.*
- from m_ash20180322 t
- where instance_number = 1
- and session_id = 3548
- and t.sql_opname <> 'SELECT';
其实从 ash 也可以看到关于 3548 阻塞的信息, 甚至从 addm 的建议中也会有类似建议:
- Rationale
- The session with ID 3548 and serial number 8795 in instance number 1 was
- the blocking session responsible for 52% of this recommendation's
- benefit.
- Rationale
- The session with ID 6377 and serial number 30023 in instance number 2
- was the blocking session responsible for 47% of this recommendation's
- benefit.
只不过我们从底层查询, 可以看到 6377 实际也是被 3548 阻塞, 找到最终阻塞者
btw, 从导入的 awrdump 中, 除了可以取 awr 外, 同样可以支持取 awrsqrpi 和 addmrpti 以及 ashrpti, 非常方便:
- SYS@jyzhao1>@?/rdbms/admin/awrrpti
- SYS@jyzhao1>@?/rdbms/admin/awrsqrpi
- SYS@jyzhao1>@?/rdbms/admin/ashrpti
- SYS@jyzhao1>@?/rdbms/admin/addmrpti
- 6.reference
- http://feed.askmaclean.com/archives/dba_hist_active_sess_history.html
来源: https://www.cnblogs.com/jyzhao/p/8628184.html