在日常工作中,会碰到如下的场景,如 MySQL 数据库升级,主服务器硬件升级等,这个时候就需要将写操作切换到另外一台服务器上,那么如何进行在线切换呢?同时,要求切换过程短,对业务的影响比较小。
MHA 就提供了这样一种优雅的方式,只会堵塞业务 0.5~2s 的时间,在这段时间内,业务无法读取和写入。
集群信息
角色 IP 地址 ServerID 类型
Master 192.168.244.10 1 写入
Candicate master 192.168.244.20 2 读
Slave 192.168.244.30 3 读
Monitor host 192.168.244.40 监控集群组
MHA 具体的搭建步骤和原理,可参考另外一篇文章:
MySQL 高可用方案 MHA 的部署和原理 http://www.linuxidc.com/Linux/2017-05/144086.htm
在线切换的步骤
1. 关闭 MHA 监控
# masterha_stop --conf=/etc/masterha/app1.cnf
2. 在线切换
# /usr/local/bin/masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=192.168.244.20 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000
其中,
--orig_master_is_new_slave 是将原 master 切换为新主的 slave,默认情况下,是不添加的。
--running_updates_limit 默认为 1s,即如果主从延迟时间(Seconds_Behind_Master),或 master show processlist 中 dml 操作大于 1s,则不会执行切换。
在线切换的输出
Tue Apr1115:28:322017- [info] MHA::MasterRotate version0.56.Tue Apr1115:28:322017- [info] Starting online master switch..Tue Apr1115:28:322017- [info]Tue Apr1115:28:322017- [info] * Phase1: Configuration Check Phase..Tue Apr1115:28:322017- [info]Tue Apr1115:28:322017- [warning] Global configurationfile/etc/masterha_default.cnf not found. Skipping.Tue Apr1115:28:322017- [info] Reading application default configuration from /etc/masterha/app1.cnf..Tue Apr1115:28:322017- [info] Reading server configuration from /etc/masterha/app1.cnf..Tue Apr1115:28:342017- [info] GTID failover mode =0Tue Apr1115:28:342017- [info] Current Alive Master:192.168.244.10(192.168.244.10:3306)Tue Apr1115:28:342017- [info] Alive Slaves:Tue Apr1115:28:342017- [info] 192.168.244.20(192.168.244.20:3306) Version=5.6.31-log (oldest major version between slaves) log-bin:enabledTue Apr1115:28:342017- [info] Replicating from192.168.244.10(192.168.244.10:3306)Tue Apr1115:28:342017- [info] Primary candidateforthe new Master (candidate_master is set)Tue Apr1115:28:342017- [info] 192.168.244.30(192.168.244.30:3306) Version=5.6.31-log (oldest major version between slaves) log-bin:enabledTue Apr1115:28:342017- [info] Replicating from192.168.244.10(192.168.244.10:3306)It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on192.168.244.10(192.168.244.10:3306)? (YES/no): yesTue Apr1115:28:472017- [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may takelongtime..Tue Apr1115:28:472017- [info] ok.Tue Apr1115:28:472017- [info] Checking MHA is not monitoring or doing failover..Tue Apr1115:28:472017- [info] Checking replication health on192.168.244.20..Tue Apr1115:28:472017- [info] ok.Tue Apr1115:28:472017- [info] Checking replication health on192.168.244.30..Tue Apr1115:28:472017- [info] ok.Tue Apr1115:28:472017- [info]192.168.244.20can be new master.Tue Apr1115:28:472017- [info]From:192.168.244.10(192.168.244.10:3306) (current master)+--192.168.244.20(192.168.244.20:3306)+--192.168.244.30(192.168.244.30:3306)To:192.168.244.20(192.168.244.20:3306) (new master)+--192.168.244.30(192.168.244.30:3306)+--192.168.244.10(192.168.244.10:3306)Starting master switch from192.168.244.10(192.168.244.10:3306) to192.168.244.20(192.168.244.20:3306)? (yes/NO): yesTue Apr1115:29:002017- [info] Checking whether192.168.244.20(192.168.244.20:3306) is okforthe new master..Tue Apr1115:29:002017- [info] ok.Tue Apr1115:29:002017- [info]192.168.244.10(192.168.244.10:3306): SHOW SLAVE STATUS returned empty result. To check replicationfiltering rules, temporarily executing CHANGE MASTER to a dummy host.Tue Apr1115:29:002017- [info]192.168.244.10(192.168.244.10:3306): Resetting slave pointing to the dummy host.Tue Apr1115:29:002017- [info] ** Phase1: Configuration Check Phase completed.Tue Apr1115:29:002017- [info]Tue Apr1115:29:002017- [info] * Phase2: Rejecting updates Phase..Tue Apr1115:29:002017- [info]Tue Apr1115:29:002017- [info] Executing master ip online change script to disablewriteon the current master:Tue Apr1115:29:002017- [info] /usr/local/bin/master_ip_online_change --command=stop --orig_master_host=192.168.244.10--orig_master_ip=192.168.244.10--orig_master_port=3306--orig_master_user='monitor'--orig_master_password='monitor123'--new_master_host=192.168.244.20--new_master_ip=192.168.244.20--new_master_port=3306--new_master_user='monitor'--new_master_password='monitor123'--orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slaveTue Apr1115:29:002017476501Set read_only on the new master.. ok.Tue Apr1115:29:002017911951Set read_only=1on the orig master.. ok.Tue Apr1115:29:002017919517Killing all application threads..Tue Apr1115:29:002017919552done.Disabling the VIP an old master:192.168.244.10 SIOCSIFFLAGS: Cannot assign requested addressTue Apr1115:29:002017- [info] ok.Tue Apr1115:29:002017- [info] Locking all tables on the orig master to reject updates from everybody (including root):Tue Apr1115:29:002017- [info] Executing FLUSH TABLES WITH READ LOCK..Tue Apr1115:29:002017- [info] ok.Tue Apr1115:29:002017- [info] Orig master binlog:pos is mysql-bin.000016:211.Tue Apr1115:29:002017- [info] Waiting to execute all relay logs on192.168.244.20(192.168.244.20:3306)..Tue Apr1115:29:012017- [info] master_pos_wait(mysql-bin.000016:211) completed on192.168.244.20(192.168.244.20:3306). Executed0events.Tue Apr1115:29:012017- [info] done.Tue Apr1115:29:012017- [info] Getting new master's binlog name and position..Tue Apr1115:29:012017- [info] mysql-bin.000009:211Tue Apr1115:29:012017- [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.244.20', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000009', MASTER_LOG_POS=211, MASTER_USER='repl', MASTER_PASSWORD='xxx';Tue Apr1115:29:012017- [info] Executing master ip online change script to allowwriteon the new master:Tue Apr1115:29:012017- [info] /usr/local/bin/master_ip_online_change --command=start --orig_master_host=192.168.244.10--orig_master_ip=192.168.244.10--orig_master_port=3306--orig_master_user='monitor'--orig_master_password='monitor123'--new_master_host=192.168.244.20--new_master_ip=192.168.244.20--new_master_port=3306--new_master_user='monitor'--new_master_password='monitor123'--orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slaveTue Apr1115:29:012017109040Set read_only=0on the new master.Enabling the VIP192.168.244.188on the new master:192.168.244.20 Tue Apr1115:29:012017- [info] ok.Tue Apr1115:29:012017- [info]Tue Apr1115:29:012017- [info] * Switching slavesinparallel..Tue Apr1115:29:012017- [info]Tue Apr1115:29:012017- [info] -- Slave switch on host192.168.244.30(192.168.244.30:3306) started, pid:17651Tue Apr1115:29:012017- [info]Tue Apr1115:29:022017- [info] Log messages from192.168.244.30...Tue Apr1115:29:022017- [info]Tue Apr1115:29:012017- [info] Waiting to execute all relay logs on192.168.244.30(192.168.244.30:3306)..Tue Apr1115:29:012017- [info] master_pos_wait(mysql-bin.000016:211) completed on192.168.244.30(192.168.244.30:3306). Executed0events.Tue Apr1115:29:012017- [info] done.Tue Apr1115:29:012017- [info] Resetting slave192.168.244.30(192.168.244.30:3306) and starting replication from the new master192.168.244.20(192.168.244.20:3306)..Tue Apr1115:29:012017- [info] Executed CHANGE MASTER.Tue Apr1115:29:012017- [info] Slave started.Tue Apr1115:29:022017- [info] End of log messages from192.168.244.30...Tue Apr1115:29:022017- [info]Tue Apr1115:29:022017- [info] -- Slave switch on host192.168.244.30(192.168.244.30:3306) succeeded.Tue Apr1115:29:022017- [info] Unlocking all tables on the orig master:Tue Apr1115:29:022017- [info] Executing UNLOCK TABLES..Tue Apr1115:29:022017- [info] ok.Tue Apr1115:29:022017- [info] Starting orig master as a new slave..Tue Apr1115:29:022017- [info] Resetting slave192.168.244.10(192.168.244.10:3306) and starting replication from the new master192.168.244.20(192.168.244.20:3306)..Tue Apr1115:29:022017- [info] Executed CHANGE MASTER.Tue Apr1115:29:022017- [info] Slave started.Tue Apr1115:29:022017- [info] All new slave servers switched successfully.Tue Apr1115:29:022017- [info]Tue Apr1115:29:022017- [info] * Phase5: New master cleanup phase..Tue Apr1115:29:022017- [info]Tue Apr1115:29:022017- [info] 192.168.244.20: Resetting slaveinfosucceeded.Tue Apr1115:29:022017- [info] Switching master to192.168.244.20(192.168.244.20:3306) completed successfully.MHA 在线切换的原理
1. 检查当前的配置信息及主从服务器的信息
包括读取 MHA 的配置文件 / etc/masterha/app1.cnf 及检查当前 slave 的健康状态
2. 阻止对当前 master 的更新
主要通过如下步骤:
1> 等待 1.5s($time_until_kill_threads*100ms),等待当前连接断开。
2> 执行 read_only=1,阻止新的 DML 操作
3> 等待 0.5s,等待当前 DML 操作完成。
4> kill 掉所有连接。
5> FLUSH NO_WRITE_TO_BINLOG TABLES
6> FLUSH TABLES WITH READ LOCK
3. 等待新 master 执行完所有的 relay log
- Waiting to execute all relay logs on192.168.244.20(192.168.244.20:3306)..
4. 将新 master 的 read_only 设置为 off,并添加 VIP
5. slave 切换到新 master 上。
1> 等待 slave(192.168.244.30)应用完原主从复制产生的 relay log,然后执行 change master 操作切换到新 master 上。
2> 释放原 master 上加的锁。
3> 因 masterha_master_switch 命令行中带有 --orig_master_is_new_slave 参数,故原 master 也切换为新 master 的从。
6. 清理新 master 的相关信息。
主要是执行了 reset slave all 操作,清除之前的复制信息。
MHA 在线切换需满足的条件
MHA 在执行在线切换之前,会判断当前的主从复制信息,只有满足了以下条件,才能执行切换动作:
1. 所有 SLAVE 的 IO 线程和 SQL 线程都在运行。
2. 所有 slave 的 Seconds_Behind_Master 小于或等于 running_updates_limit 的值,该参数如果没有显示指定的话,则默认为 1s
3. 在 master 上,通过 show processlist 输出,没有一个 DML 操作的时间大于 running_updates_limit 的值。
在线切换时,打开 general log,各个服务器的操作信息
注:在执行 masterha_master_switch 命令时,会有两次确认操作
1. It is better to execute FLUSH NO_WRITE_TO_BINLOG TABLES on the master before switching. Is it ok to execute on 192.168.244.10(192.168.244.10:3306)? (YES/no):
2. Starting master switch from 192.168.244.10(192.168.244.10:3306) to 192.168.244.20(192.168.244.20:3306)? (yes/NO):
以下输出中间都有两次空白,其中第一次空白之前的输出对应第一次确认之前,第二次之前的输出对应第二次确认之前。
原 master 192.168.244.10
17041216:52:3823Connect monitor@node4 on23Query set autocommit=123Query SELECT CONNECTION_ID() AS Value17041216:52:3924Connect monitor@node4 on24Query set autocommit=124Query SELECT CONNECTION_ID() AS Value24Query SET wait_timeout=8640024Query SELECT @@global.server_id As Value24Query SELECT VERSION() AS Value24Query SELECT @@global.gtid_mode As Value24Query SHOW GLOBAL VARIABLES LIKE'log_bin'24Query SHOW MASTER STATUS24Query SELECT @@global.datadir AS Value24Query SELECT @@global.slave_parallel_workers AS Value24Query SHOW SLAVE STATUS24Query SELECT @@global.read_only As Value24Query SELECT @@global.relay_log_purge As Value17041216:54:0624Query FLUSH NO_WRITE_TO_BINLOG TABLES24Query SELECT GET_LOCK('MHA_Master_High_Availability_Monitor','0') AS Value24Query SHOW PROCESSLIST17041216:55:5124Query SHOW SLAVE STATUS24Query CHANGE MASTER TO MASTER_HOST='dummy_host'17041216:55:5224Query SHOW SLAVE STATUS24Query RESET SLAVE/*!50516 ALL*/24Query SELECT RELEASE_LOCK('MHA_Master_High_Availability_Monitor') As Value24Quit 25Connect monitor@node4 on25Query set autocommit=125Query SELECT CONNECTION_ID() AS Value25Query SET sql_log_bin=025Query SHOW PROCESSLIST25Query SELECT @@global.read_only As Value25Query SET GLOBAL read_only=125Query SELECT @@global.read_only As Value25Query SHOW PROCESSLIST25Query SET sql_log_bin=125Quit 26Connect monitor@node4 on26Query set autocommit=126Query SELECT CONNECTION_ID() AS Value26Query SET wait_timeout=8640026Query FLUSH TABLES WITH READ LOCK26Query SHOW MASTER STATUS17041216:55:5326Query UNLOCK TABLES26Query CHANGE MASTER TO MASTER_HOST ='192.168.244.20'MASTER_USER ='repl'MASTER_PASSWORD =slave 192.168.244.30
- 170412 16:52:38 23 Connect monitor@node4 on
- 23Query set autocommit=1
- 23 Query SELECT CONNECTION_ID() AS Value
- 170412 16:52:39 24 Connect monitor@node4 on
- 24Query set autocommit=1
- 24 Query SELECT CONNECTION_ID() AS Value
- 24Query SET wait_timeout=86400
- 24 Query SELECT @@global.server_id As Value
- 24 Query SELECT VERSION() AS Value
- 24 Query SELECT @@global.gtid_mode As Value
- 24Query SHOW GLOBAL VARIABLES LIKE'log_bin'
- 24 Query SHOW MASTER STATUS
- 24 Query SELECT @@global.datadir AS Value
- 24 Query SELECT @@global.slave_parallel_workers AS Value
- 24 Query SHOW SLAVE STATUS
- 24 Query SELECT @@global.read_only As Value
- 24 Query SELECT @@global.relay_log_purge As Value
- 24 Query SELECT @@global.relay_log_info_repository AS Value
- 24 Query SELECT @@global.datadir AS Value
- 24 Query SELECT @@global.relay_log_info_file AS Value
- 24 Query SHOW SLAVE STATUS
- 24Query SELECT Repl_slave_priv AS Value FROM mysql.user WHERE user ='repl'
- 170412 16:54:06 24Query SELECT GET_LOCK('MHA_Master_High_Availability_Failover','0') AS Value
- 24 Query SHOW SLAVE STATUS
- 24 Query SHOW SLAVE STATUS
- 170412 16:55:52 24 Query SHOW PROCESSLIST
- 25 Connect monitor@node4 on
- 25Query set autocommit=1
- 25 Query SELECT CONNECTION_ID() AS Value
- 25 Query SELECT @@global.read_only As Value
- 25 Query SELECT @@global.read_only As Value
- 25 Quit
- 24 Query SHOW SLAVE STATUS
- 24Query SELECT MASTER_POS_WAIT('mysql-bin.000017','120',0) AS Result
- 24 Query STOP SLAVE SQL_THREAD
- 24 Query SHOW SLAVE STATUS
- 24 Query SHOW MASTER STATUS
- 26 Connect monitor@node4 on
- 26Query set autocommit=1
- 26 Query SELECT CONNECTION_ID() AS Value
- 26Query SET sql_log_bin=0
- 26 Query SELECT @@global.read_only As Value
- 26Query SET GLOBAL read_only=0
- 26Query SET sql_log_bin=1
- 26 Quit
- 24 Query SELECT @@global.read_only As Value
- 27 Connect repl@node3 on
- 27 Query SELECT UNIX_TIMESTAMP()
- 27Query SHOW VARIABLES LIKE'SERVER_ID'
- 27Query SET @master_heartbeat_period=1799999979520
- 27Query SET @master_binlog_checksum= @@global.binlog_checksum
- 27 Query SELECT @master_binlog_checksum
- 27 Query SELECT @@GLOBAL.GTID_MODE
- 27Query SHOW VARIABLES LIKE'SERVER_UUID'
- 27Query SET @slave_uuid='8a1093c8-1d00-11e7-954f-000c299a5715'
- 27Binlog Dump Log:'mysql-bin.000010'Pos:120
- 170412 16:55:53 28 Connect repl@node1 on
- 28 Query SELECT UNIX_TIMESTAMP()
- 28Query SHOW VARIABLES LIKE'SERVER_ID'
- 28Query SET @master_heartbeat_period=1799999979520
- 28Query SET @master_binlog_checksum= @@global.binlog_checksum
- 28 Query SELECT @master_binlog_checksum
- 28 Query SELECT @@GLOBAL.GTID_MODE
- 28Query SHOW VARIABLES LIKE'SERVER_UUID'
- 24 Query STOP SLAVE
- 28Query SET @slave_uuid='2a6365e0-1d05-11e7-956d-000c29c64704'
- 28Binlog Dump Log:'mysql-bin.000010'Pos:120
- 24 Query SHOW SLAVE STATUS
- 24Query RESET SLAVE/*!50516 ALL */
- 24 Query SHOW SLAVE STATUS
- 24Query SELECT RELEASE_LOCK('MHA_Master_High_Availability_Failover') As Value
- 24Quit
- 170412 16:52:37 16 Connect monitor@node4 on
- 16Query set autocommit=1
- 16 Query SELECT CONNECTION_ID() AS Value
- 170412 16:52:38 17 Connect monitor@node4 on
- 17Query set autocommit=1
- 17 Query SELECT CONNECTION_ID() AS Value
- 17Query SET wait_timeout=86400
- 17 Query SELECT @@global.server_id As Value
- 17 Query SELECT VERSION() AS Value
- 17 Query SELECT @@global.gtid_mode As Value
- 17Query SHOW GLOBAL VARIABLES LIKE'log_bin'
- 17 Query SHOW MASTER STATUS
- 17 Query SELECT @@global.datadir AS Value
- 17 Query SELECT @@global.slave_parallel_workers AS Value
- 17 Query SHOW SLAVE STATUS
- 17 Query SELECT @@global.read_only As Value
- 17 Query SELECT @@global.relay_log_purge As Value
- 17 Query SELECT @@global.relay_log_info_repository AS Value
- 17 Query SELECT @@global.datadir AS Value
- 17 Query SELECT @@global.relay_log_info_file AS Value
- 17 Query SHOW SLAVE STATUS
- 17Query SELECT Repl_slave_priv AS Value FROM mysql.user WHERE user ='repl'
- 170412 16:54:05 17Query SELECT GET_LOCK('MHA_Master_High_Availability_Failover','0') AS Value
- 17 Query SHOW SLAVE STATUS
- 17 Query SHOW SLAVE STATUS
- 170412 16:55:50 17 Query SHOW SLAVE STATUS
- 170412 16:55:51 17<
来源: http://www.linuxidc.com/Linux/2017-05/144085.htm