当前位置：

首页
/
IT
/
linux
/
记一次生产数据库 "意外" 重启的经历

记一次生产数据库 "意外" 重启的经历

前言

在一个阳光明媚的下午, 电脑右下角传来一片片邮件提醒, 同时伴随着微信钉钉的震动, 打开一看, 应用各种出错, 天兔告警, 数据库服务器内存爆红, MySQL 数据库实例挂掉了.

排查

先交代一下数据库版本:

MySQL> status
--------------
MySQL  Ver 14.14 Distrib 5.7.22-22, for Linux (x86_64) using  6.2
Connection id:          59568
Current database:
Current user:           root@localhost
SSL:                    Not in use
Current pager:          stdout
Using outfile:          ''
Using delimiter:        ;
Server version:         5.7.22-22-log Percona Server (GPL), Release 22, Revision f62d93c
Protocol version:       10

崩溃故障排除绝不是一项有趣的任务, 特别是如果 MySQL 没有报告崩溃的原因. 例如, 当 MySQL 内存不足时.

数据库邮件告警提醒发来的消息:

Type: MySQL

Tags: 生产主库

Host: 172.16.1.66:3306
Level: critical
Item: connect
Value: down
Message: MySQL server down

顺手检查一下之前的服务器邮件监控告警记录, 上一个时间点, 内存占用率 99%, 这说明了数据库连接的幅度增长, 可能是压垮服务器的最后一根稻草.

其实导致 OOM 的直接原因并不复杂, 就是因为服务器内存不足, 内核需要回收内存, 回收内存就是 kill 掉服务器上使用内存最多的程序, 而 MySQL 服务可能就是使用内存最多, 所以就 OOM 了.

Type: os

Tags: 66 数据库

Host: 172.16.1.66:
Level: critical
Item: memory
Value: 99%
Message: too more memory usage

查看系统日志

我们带着这个疑问来排查一下日志:

# 查看日志
tail -500f  /var/log/messages
# 以下是 oom-killer
Nov 27 14:55:48 itstyledb1 kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
Nov 27 14:55:48 itstyledb1 kernel: mysqld cpuset=/ mems_allowed=0-1
Nov 27 14:55:48 itstyledb1 kernel: CPU: 2 PID: 895 Comm: mysqld Kdump: loaded Not tainted 3.10.0-862.3.2.el7.x86_64 #1
Nov 27 14:55:48 itstyledb1 kernel: Hardware name: Huawei RH1288 V3/BC11HGSC0, BIOS 3.22 05/16/2016
Nov 27 14:55:48 itstyledb1 kernel: Call Trace:

小伙伴们继续往下看:

0 pages HighMem/MovableOnly
Nov 27 14:55:48 itstyledb1 kernel: 291281 pages reserved
Nov 27 14:55:48 itstyledb1 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Nov 27 14:55:48 itstyledb1 kernel: [  468]     0   468    28271     4326      62       55             0 systemd-journal
Nov 27 14:55:48 itstyledb1 kernel: [  490]     0   490    11492        2      24      553         -1000 systemd-udevd
Nov 27 14:55:48 itstyledb1 kernel: [  787]     0   787    13877       18      27       96         -1000 auditd
Nov 27 14:55:48 itstyledb1 kernel: [  810]    81   810    14552       81      34       89          -900 dbus-daemon
Nov 27 14:55:48 itstyledb1 kernel: [  815]     0   815    55956        1      60      466             0 abrtd
Nov 27 14:55:48 itstyledb1 kernel: [  816]     0   816    55327        9      64      346             0 abrt-watch-log
Nov 27 14:55:48 itstyledb1 kernel: [  818]     0   818   121607      220      90      495             0 NetworkManager
Nov 27 14:55:48 itstyledb1 kernel: [  822]     0   822     5415       49      16       33             0 irqbalance
Nov 27 14:55:48 itstyledb1 kernel: [  823]   997   823   134634       97      60     1306             0 polkitd
Nov 27 14:55:48 itstyledb1 kernel: [  825]     0   825     6594       42      20       41             0 systemd-logind
Nov 27 14:55:48 itstyledb1 kernel: [  830]     0   830    31578       28      21      139             0 crond
Nov 27 14:55:48 itstyledb1 kernel: [  839]     0   839    27522        2      10       31             0 agetty
Nov 27 14:55:48 itstyledb1 kernel: [ 1142]     0  1142   143454      114      97     2672             0 tuned
Nov 27 14:55:48 itstyledb1 kernel: [ 1144]     0  1144    28203       11      59      246         -1000 sshd
Nov 27 14:55:48 itstyledb1 kernel: [ 1145]     0  1145    97438      694     103      328             0 rsyslogd
Nov 27 14:55:48 itstyledb1 kernel: [ 1369]     0  1369    22526       20      44      256             0 master
Nov 27 14:55:48 itstyledb1 kernel: [ 1371]    89  1371    22596       32      46      251             0 qmgr
Nov 27 14:55:48 itstyledb1 kernel: [ 5140]     0  5140     5102     1617      15      239             0 mysqld_exporter
Nov 27 14:55:48 itstyledb1 kernel: [ 9430]     0  9430    55966      378      62      790             0 snmpd
Nov 27 14:55:48 itstyledb1 kernel: [30320]    27 30320 22951376 13928375   43437  8163662             0 mysqld
Nov 27 14:55:48 itstyledb1 kernel: [  688]    89   688    22552      271      46        0             0 pickup
Nov 27 14:55:48 itstyledb1 kernel: Out of memory: Kill process 30320 (mysqld) score 984 or sacrifice child
Nov 27 14:55:48 itstyledb1 kernel: Killed process 30320 (mysqld) total-vm:91805504kB, anon-rss:55713500kB, file-rss:0kB, shmem-rss:0kB
Nov 27 14:56:00 itstyledb1 systemd: mysqld.service: main process exited, code=killed, status=9/KILL
Nov 27 14:56:00 itstyledb1 systemd: Unit mysqld.service entered failed state.
Nov 27 14:56:00 itstyledb1 systemd: mysqld.service failed.
Nov 27 14:56:00 itstyledb1 systemd: mysqld.service holdoff time over, scheduling restart.
Nov 27 14:56:01 itstyledb1 systemd: Starting MySQL Server...

当 out of memory 发生时, out_of_memory 函数会选择一个内核认为犯有分配过多内存 "罪行" 的进程, 并杀死该进程. 显然 MySQL 就是哪个 "罪人".

随后 MySQL 会自动重启. 重启以后, 内存是下来了, 但是临近下班的时候, 差不多又又又占满了.

[root@itstyledb1 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:          55803       54976         241          10         585         349
Swap:         32064       25036        7028

找到 MySQL 进程, 执行以下 top -p pid, 内存使用 52.4g

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
935 MySQL     20   0   79.7g  52.4g   7336 S   0.3 96.1 255:44.76 mysqld

计算内存使用

1) 查看 MySQL 全局占用多少内存

SELECT (@@innodb_buffer_pool_size
+@@innodb_log_buffer_size
+@@key_buffer_size) / 1024 /1024 AS MEMORY_MB;

查询结果为:

+----------------+
| MEMORY_MB      |
+----------------+
| 20512.00000000 |
+----------------+

2) 查看 performance_schema 占用多少内存

SELECT SUBSTRING_INDEX(event_name,'/',2) AS
       code_area, sys.format_bytes(SUM(current_alloc))
       AS current_alloc
       FROM sys.x$memory_global_by_current_bytes
       GROUP BY SUBSTRING_INDEX(event_name,'/',2)
       ORDER BY SUM(current_alloc) DESC;

查询结果为:

+---------------------------+---------------+
| code_area                 | current_alloc |
+---------------------------+---------------+
| memory/performance_schema | 349.80 MiB    |
+---------------------------+---------------+

3) 查看每个线程占用多少内存

SELECT ( ( @@read_buffer_size
+ @@read_rnd_buffer_size
+ @@sort_buffer_size
+ @@join_buffer_size
+ @@binlog_cache_size
+ @@thread_stack
+ @@max_allowed_packet
+ @@net_buffer_length )
) / (1024*1024) AS MEMORY_MB;

查询结果为:

+-----------+
| MEMORY_MB |
+-----------+
|   87.5156 |
+-----------+

查看当前线程

show full processlist

最终结果为:

+-----------+
| MEMORY_MB |
+-----------+
| 87.5156*37|
+-----------+

4) 查看 memory 存储引擎占用多少内存

SELECT SUM(max_data_length)/1024/1024 AS MEMORY_MB FROM information_schema.tables WHERE ENGINE='memory';

查询结果为:

+---------------+
| MEMORY_MB     |
+---------------+
| 3857.37713909 |
+---------------+

以上四项加起来差不多也就 27975MB, 差不错 28G 的样子, 但是 MySQL 进程显示占用了 52.4G, 那么剩下 24.4G 去哪了?

线程池

此线程池非彼连接池, 其实两者是有很大区别的, 连接池一般在客户端设置, 而线程池是在 DB 服务器上配置; 另外连接池可以取到避免了连接频繁创建和销毁, 但是无法取到控制 MySQL 活动线程数的目标, 在高并发场景下, 无法取到保护 DB 的作用. 比较好的方式是将连接池和线程池结合起来使用.

关于线程池的一些参数:

MySQL> show variables like 'thread%';
+-------------------------------+---------------------------+
| Variable_name                 | Value                     |
+-------------------------------+---------------------------+
| thread_handling               | one-thread-per-connection |
| thread_pool_high_prio_mode    | transactions              |
| thread_pool_high_prio_tickets | 4294967295                |
| thread_pool_idle_timeout      | 60                        |
| thread_pool_max_threads       | 100000                    |
| thread_pool_oversubscribe     | 3                         |
| thread_pool_size              | 12                        |
| thread_pool_stall_limit       | 500                       |
+-------------------------------+---------------------------+
thread_handling:

该参数是配置线程模型, 默认情况是 one-thread-per-connection, 也就是不启用线程池. 将该参数设置为 pool-of-threads 即启用了线程池.

thread_pool_size:

该参数是设置线程池的 Group 的数量, 默认为系统 CPU 的个数, 充分利用 CPU 资源.

thread_pool_oversubscribe:

该参数设置 group 中的最大线程数, 每个 group 的最大线程数为 thread_pool_oversubscribe+1, 注意 listener 线程不包含在内.

thread_pool_high_prio_mode:

高优先级队列的控制参数, 有三个值 (transactions/statements/none), 默认是 transactions, 三个值的含义如下:

transactions: 对于已经启动事务的语句放到高优先级队列中, 不过还取决于后面的 thread_pool_high_prio_tickets 参数

statements: 这个模式所有的语句都会放到高优先级队列中, 不会使用到低优先级队列

none: 这个模式不使用高优先级队列

thread_pool_high_prio_tickets:

该参数控制每个连接最多语序多少次被放入高优先级队列中, 默认为 4294967295, 注意这个参数只有在 thread_pool_high_prio_mode 为 transactions 的时候才有效果.

thread_pool_idle_timeout:

worker 线程最大空闲时间, 默认为 60 秒, 超过限制后会退出.

thread_pool_max_threads:

该参数用来限制线程池最大的线程数, 超过该限制后将无法再创建更多的线程, 默认为 100000.

thread_pool_stall_limit:

该参数设置 timer 线程的检测 group 是否异常的时间间隔, 默认为 500ms.

最终配置如下:

#thread pool
thread_handling=pool-of-threads
#Group 的数量, 默认为系统 CPU 的个数, 充分利用 CPU 资源
thread_pool_size=24
# 每个 group 的最大线程数为 thread_pool_oversubscribe+1
thread_pool_oversubscribe=3
performance_schema=off
#extra connection, 防止线程池满的情况下无法登录 MySQL
extra_max_connections = 8
extra_port = 33333

备注: 线程池在 Percona,MariaDB,Oracle MySQL 企业版中提供, Oracle MySQL 社区版并不提供.

线程池貌似并不会直接导致内存不回收, 网上有说同时开启 Thread pool 和 PS 会出现内存泄露, 但是

目前 Percona server 5.7.21-20 + 版本已经修复了这个问题, 显然是不存在的.

慢查询

由于是生产环境, 这个问题拖得时间有点长, 那么慢查询会不会影响内存使用问题呢? 带着这个问题, 查看了慢查询后台列表, 在数据库奔溃的前一个时间段, 的确有不少慢查询语句. 但是这并不能在一定程度上说明问题, 由于服务器的 MySQL 服务在杀死之前, 内存已经见底, 此时连接数并不多, 也就三四十来个左右, 大多处于休眠状态, 并且此时已经占用了大部分的 Swap 空间. 也就是说, 在资源有限的情况下必定会出现不少慢查询语句.

小结

其实这个 "意外" 一点也不意外, 其实已经发生了多次了. 但是还是做个小结吧, 因为最终没有确认问题出现在哪里, 所以还是发布了吧, 万一有专业的 DBA 遇到类似的问题还可以小小的解惑一下.

参考

https://bugs.mysql.com/bug.php?id=91861
https://bugs.mysql.com/bug.php?id=91710
https://dev.mysql.com/doc/refman/5.7/en/memory-use.html
https://www.percona.com/blog/2018/06/28/what-to-do-when-mysql-runs-out-of-memory-troubleshooting-guide/
https://dev.mysql.com/doc/refman/5.7/en/thread-pool-tuning.html

来源: https://www.cnblogs.com/smallSevens/p/10067324.html

与本文相关文章

暂无,快来抢沙发吧！