, 问题背景
探活脚本连续 8 此探测, 判断主库异常, 触发切换 (判断备机是否有延迟, kill 原主, VIP 飘到备机, 设置新主可写) 切换后, 业务还是异常, SQL 查询没返回, DB 连接数耗完, 为了恢复业务, 重启新主后业务恢复
两个问题
主库为什么会切换
切换后新主库为什么还是不可用
, 问题排查
1, 根据系统运行时间, 系统日志以及服务器带外日志可排除服务器和数据句 OOM 2, 从每分钟的县城快照中发现故障时大量线程处于 sending data 和 statistics 状态, 但前一分钟快照中未看到任何阻塞与锁等待 3, 被阻塞的 sql 都是基于主键或者业务索引访问, 理论上没问题, 提取 sql 在从库执行一遍, 很快, 且当时没有产生慢日志, 固排除 sql 执行效率慢导致阻塞 4,statistics 是为 sql 生成执行计划的, 会触发表的统计操作, 而统计操作需要对表中 page 进行采样, 会触发 io, 分析当时磁盘 iops, 吞吐量, cpu 负载等, 和前一天基本吻合, 排除系统负载导致性能下降 ----------------- 思路中断~~~
复盘, 开发反映业务切到新 db, 业务各个接口耗时变大, 询问是否新库服务器性能不如老库 会后上机器发现实例上存在 20 个处于 user sleep 状态的 sql, 大概模型都是 where id = '+(select'rbzd'where 6910=6910 and sleep(300)+)', 比较可疑, 因为开发不会在程序中调用 sleep 函数
重点分析此 sql 此 sql 不占用系统资源, 但是写法可以, 类似 sql 注入 经查, sleep 操作和 innodb_thread_concurrency 参数互斥, 这样每秒只能处理 24 个 sql 换言之当有 24 个线程进入引擎并处于 sleep 状态的话, 其他线程是不能进入 innodb 引擎层, 这里的 24 是和线上 MySQL 参数 innodb_thread_concurrency 被设置为 24 有关
, 模拟故障
2.1 准备数据
- CREATE TABLE `test` (
- `id` int(11) DEFAULT NULL
- ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
- insert into test select 1;
2.2 开 24 个线程访问此表
提前设置 innodb_thread_concurrency 设置为 24 模拟线上情况
[root@VM_42_63_centos ~]# for i in `seq 1 24`; do nohup mysql -S /tmp/mysql.sock3306 -e "select id from test.test where id=''+(select 'rbzd' where 6910=6910 and sleep(300))+''" & done
2.3 观察一波
执行上一步脚本后, 再开一个 session 访问 test 表会发现线程 hang 住, 没返回, 状态为 sending data, 如下:
- (root@localhost) [test]> show processlist;
- +----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
- | Id | User | Host | db | Command | Time | State | Info |
- +----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
- | 2 | root | localhost:38204 | NULL | Sleep | 0 | | NULL |
- | 3 | root | localhost | test | Query | 0 | starting | show processlist |
- | 4 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 5 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 6 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 7 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 8 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 9 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 10 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 11 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 12 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 13 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 14 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 15 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 16 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 17 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 18 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 19 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 20 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 21 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 22 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 23 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 24 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 25 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 26 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 27 | root | localhost | NULL | Query | 23 | User sleep | select id from test.test where id=''+(select'rbzd'where 6910=6910 and sleep(300))+'' |
- | 28 | root | localhost | NULL | Query | 11 | Sending data | select id from test.test |
- +----+------+-----------------+------+---------+------+--------------+----------------------------------------------------------------------------------------+
- 27 rows in set (0.00 sec)
, 故障回顾
1,sql 注入, user sleep 状态的 sql 累计到 24 条之后, 其他 sql 就不能进入 innodb 进行操作, 包括高可用探测程序, 由于线程快照中过滤了 sleep, 导致没抓到注入的 sql, 加大了后续排查难度
2, 主库的存活探测程序检查失败 (探测方式为 update 一个 innodb 表), 连续 8 次失败后, ha 认为实例异常, 则 kill 主, 触发切换流程
3, 切换到主库后, 注入还在继续, 所以同样的故障在新主上重演
4, 重启主库后, 仅有 20 个注入进入 innodb 且一直为 user sleep, 由于没达到 24 个触发阈值, 所以业务表现正常, 只是性能不及老主库, 原因就是已经有 20 个线程在 innodb 层一直没退出, kill 掉这些线程, 业务恢复正常
, 问题解决
由于线程快照过滤了 sleep 关键字, 导致排查过程较长, 后续要对 user sleep 状态线程保留在线程快照中
开发优化程序, 防止 sql 注入
来源: http://www.bubuko.com/infodetail-2745843.html