Redis Cluster
Redis 作为一个支持多种数据结构的键值对内存数据库, 被广泛用于饿了么的各类业务场景中. 其官方集群方案 Redis Cluster https://redis.io/topics/cluster-spec 因其支持分片, 高可用和平滑扩缩容, 被我们选为最主要的分布式 Redis 方案.
我们的主要工作是给 Redis Cluster 做自动化运维和管理工具.
Redis Cluster 简介
Redis Cluster 由多个分片组成, 每个分片包含一个 master 和多个 slave 来实现高可用, 每个 master 和 slave 都是单线程. Redis 每个请求和数据都对应一个 key, Redis Cluster 将请求的所有 key 通过 crc16 哈希到 16384(2 的 14 次方)个 slot 中, 并将这 16384 个 slot 分配到集群中的所有分片上. 这样每个分片就只负责一部分请求和数据. Redis Cluster 的扩缩容就是在分片间迁移 slot 和增加剔除分片节点.
自动化运维平台
Redis Cluster 功能完善, 但创建和变更集群都要按一套复杂流程调集群接口来实现.
例如扩容集群要走以下流程:
发送 CLUSTER MEET 命令添加新节点
对迁移 slot 的两个节点使用 CLUSTER SETSLOT 标记迁移状态
循环发送 MIGRATE 命令迁移数据
再次使用 CLUSTER SETSLOT 命令变更 slot 所有权
而官方对此提供了一个 Ruby 脚本来简化运维工作. 使用这个脚本意味着必须登上机器操作, 这对于需要运维成百上千个集群的用户来说是完全不够的.
因此早期大规模使用 Redis Cluster 的用户实现自动化运维的方案一般就是重新实现这个脚本并在此基础上实现一个运维平台, 也就是我们的初期方案.
这样对应到平台上面的功能就是:
挑选机器部署 Redis 节点
指定 Redis 节点添加到集群
迁移 slot 到新节点
集群结构规范化
为什么以上三个步骤不合并为一个扩容操作呢? 因为当时并没有想明白每个集群真正需要做的变更操作, 拆分出来的小功能点会用于多处. 例如部署新节点可能用于扩容, 也可能用于补充因宕机缺失的从节点. 而各个集群的结构千差万别, 有每个分片一个 slave, 有每个分片多个 slave, 也有因人为疏忽缺失 slave 节点的集群, 针对不同结构的集群做变更和修复需要结合多个小功能点.
而无法确定需要哪些操作的原因是集群结构不一. 集群结构也就是哪些机器上部署节点, master-slave 怎么分布.
然而我们一开始人肉管理集群结构导致线上集群的结构越来越不同, 问题越来越多, 需要运维工具提供的操作越来越复杂.
问题有:
缺少 slave 节点
master 和 slave 节点分布在同一台机器上, 机器宕机后该分片没有其它 slave 节点就会不可用
过半数 master 节点都在某一台机器上, 该机器宕机导致整个集群不可用
宕机后 master 节点切换导致少数机器网卡被打爆
slot 不平均
每台机器压力不平均
这时我们发现无论实现运维工具还是做运维操作都非常困难, 运维工具开始需要使用复杂的图算法, 运维人员在面板上需要做大量的小操作才能完成任务, 而且线上事故频发.
Redis Cluster 自身不是多线程, 无法在一个集群上面提供租户功能, 我们不得不将各个业务方对应的集群混合部署在多台机器上. 对于这种单台机器有多个 master slave 节点的结构, 集群结构决定了稳定性. 只有让程序管理集群才能保证良好的集群结构, 才能减少运维人员操作的复杂度. 而让程序管理集群结构必须建立一套集群结构的规则.
运维集群和实现运维工具的核心是规范化集群.
规范与 Chunk 算法
集群结构规范需要满足以下规则:
master 节点和 slave 节点不能在一台机器上
不能有过半数(包括半数)master 节点在同一台机器上(否则宕机后集群无法选举)
集群在挂一台机器的情况下, 压力应尽可能平均分流到其它机器
master 节点和 slave 节点在各台机器上的分布应当平均
有一些公司采取的是根据机器各项指标 (CPU, 内存, 网卡使用率) 分配单个节点的方案, 并且像迁移容器一样动态迁移单个 Redis 节点来实现机器的压力平衡, 配以以上规则来约束节点分配和迁移.
而我们希望用更简单的分配方式和避免动态迁移节点来减少复杂度.
我们的方案是首先将每 4 个 Redis 节点分为一组, 并且满足以下结构, 平均分到两台机器上, 两两互为 master-slave.
所有集群都由一个个 chunk 组合而成, 这样 master-slave 必不在同一台机器上, 且每台机器上的 master-slave 比例永远是 1:1, 只要每台机器分配的节点数是平均的, master slave 节点在每台机器的数量也是平均的, 规则 (1)(4) 就满足了.
而我们设计了一个分配算法让分配出来的集群可以满足(2)(3). 详细说明和证明有兴趣可以看文末附件.
这种结构还能让增删节点 (增删 chunk) 后集群还能满足上述的规则.
规范了集群的结构后, 只要集群不满足初始的结构(例如宕机或 master-slave 切换), 就把集群修复到原来的结构.
这样集群一直处于健康的结构, 并且由于结构是固定并且有规律的, 无需提供纷繁的小操作, 运维人员常规做的只有:
创建集群
替换机器(也可以用于处理宕机)
迁移集群(用于扩缩容)
Redis Cluster 的坑
运维 Redis Cluster 两年期间我们发现了不少集群相关的 bug, 如
epoch 解决冲突机制失效导致同一分片同时存在两个 master 节点
选举有时候需要花几分钟而不是常规的 30 秒内
节点间握手可能会一直不成功
集群元信息不一致但是不能自动修复 (这个问题官方最近开始考虑修复了)
迁移 slot 失败 (例如迁移中迁出节点宕机) 后部分迁移中的 slot 会无人管.
遇上这种问题往往很难定位集群中哪些状态是不正确的, 而操作不健康的集群又容易踩进其它坑里面, 通过当场修复不正确的状态往往需要耗费大量. 早年多起相关问题的事故都教训了我们不要浪费时间去修复集群. 对于这类集群损坏的问题我们目前全部都是通过切换到一个新集群来解决.
而正常的变更也不容易做好. 从很多工具 (包括官方脚本) 都没有处理好集群中各种隐含的状态和边界条件. 比如迁移 slot 最后的 CLUSTER SETSLOT 如果操作顺序不对会引发大量请求重定向. Redis GitHub issue 上也可以看到不少用官方脚本操作后集群完全损坏的问题.
而扩容中, 我们发现有时 MIGRATE 命令会返回 busy loading 的错误, 这是由于节点会有一个很短暂的初始化阶段, 这个阶段的节点会拒绝服务几乎所有类型的请求. 此外, 还需要考虑实现迁移 slot 的并行化来加速集群的扩缩容. 我们在官方的 slot 迁移上做了很多功夫, 最后发现可靠的方案逻辑复杂, 速度又慢. 最后还是不得不抛弃原生的扩缩容, 使用一致性差但是稳定性更好的建新集群加导数据的方案.
后续的工作
目前我们正在做集群的快速扩缩和新的 Redis Cluster Proxy, 欢迎有兴趣的同学加入我们, 发简历到 tianwen.zhang@ele.me
chunk 分配算法及证明
by 杨瑒
- The node allocation algorithm aims to distribute chunks of nodes to achieve
- the maximum of balance, aka. trying best to spread the failover of slaves
- on the lost host most widely across the whole cluster.
- Note:
- 1. In this proof, we'll use"="as assignment, and"==" as mathematical equal
- to conform to python's notations.
- 2. It is hard to write mathematical symbols and notations in plain text,
- and therefore we adopt some of python's functions to help explanation.
- Glossary:
- 1. chunk:
- A group of nodes in 4 across 2 hosts where 1 master node
- and 1 slave node each.
- Either HOST 1 or HOST 2 fails, Redis node A and B
- should work fine after failing-over to the other host.
- -------------------------------
- one chunk
- | HOST 1 | | HOST 2 |
- | master A | <--> | slave A |
- | slave B | <--> | master B |
- -------------------------------
- 2. link(ed) host:
- One chunk consists of 2 hosts, and we call them linked since the slave
- of the first host's master is on the second host, and vice versa.
- Algorithm Description:
- Assume there're n hosts, whose corresponding available node number
- is host[i], where i in range(0, n) (aka. len(host) == n)
- The available node number on each host should meet the premise of:
- 1. sum(host) % 4 == 0
- 2. host[i] is integer for i in range(0, n)
- 3. host[i] % 2 == 0 for i in range(0, n)
- 4. host[m] <= sum(<host[i]>, i in range(0, n) && i != m),
- where m = max_index(host) (the index of the max value in host)
- Build up a link table (lt) which tracks the number of links
- from host[i] to host[j], where i, j in range(0, n):
- | | HOST 0 | HOST 1 | HOST 2 | HOST 3 | HOST 4 |
- |--------|----------|--------- |----------|--------- |----------|
- | HOST 0 | 0 | lt[0, 1] | lt[0, 2] | lt[0, 3] | lt[0, 4] |
- | HOST 1 | lt[0, 1] | 0 | lt[1, 2] | lt[1, 3] | lt[1, 4] |
- | HOST 2 | lt[0, 2] | lt[1, 2] | 0 | lt[2, 3] | lt[2, 4] |
- | HOST 3 | lt[0, 3] | lt[1, 3] | lt[2, 3] | 0 | lt[3, 4] |
- | HOST 4 | lt[0, 4] | lt[1, 4] | lt[2, 4] | lt[3, 4] | 0 |
- (A link table of hosts of 5)
- Steps:
- 1. Check the input, if host does not meet up with the premise, reject.
- 2. Initialize the lt (link table) to zeros.
- 3. while any(host> 0) do following:
- 3.1. m = max_index(host) (the index of the max value in host)
- 3.2. llh = find_least_linked_host_index(m)
- (search the link table at row m, and find the minimum excluding self)
- i.e.
- | | HOST 0 | HOST 1 | HOST 2 | HOST 3 | HOST 4 |
- |--------|----------|--------- |----------|--------- |----------|
- | HOST 0 | 0 | 2 | 4 | 4 | 4 |
- In such case, llh is 1 (HOST 1) (remember to exclude self, which is 0)
- 3.3. establish a link between m and llh and create a chunk
- 3.4 lt[m, llh] = lt[m, llh] + 1
- 3.5. host[m] = host[m] - 2
- 3.6. host[llh] = host[llh] - 2
- Proof of convergence: (Prove that sum(host) == 0 finally)
- Lemma 1. For each iteration,
- host[m] <= sum(host[i], i in range(0, n) && i != m),
- where m = max_index(host)
- is always true.
- Proof of Lemma 1:
- According to premise 4, the initial state is true.
- Let host_next list be the next state of host after one iteration,
- p be the index of host linked with m in the new chunk.
- Therefore, we have:
- 1. host_next[m] = host[m] - 2
- 2. host_next[p] = host[p] - 2
- The next iteration state diverges into 2 following conditions:
- 1) Assume m is still the max_index of host_next, which is:
- m == max_index(host_next)
- Left of inequation:
- host_next[m]
- == host[m] - 2
- Right of inequation:
- sum(host_next[i], i in range(0, n) && i != m)
- == sum(<host[i]>, i in range(0, n) && i != m) - 2
- Premise 4:
- host[m] <= sum(<host[i]>, i in range(0, n) && i != m)
- Therefore:
- host[m] - 2 <= sum(<host[i]>, i in range(0, n) && i != m) - 2
- host_next[m] <= sum(<host_next[i]>, i in range(0, n) && i != m)
- 2) Assume q = max_index(host_next), and q != m
- Suppose q == p, then, host_next[q] == host_next[p] == host[p] - 2
- Since,
- 1. host_next[m] == host[m] - 2
- 2. host[m]> host[p] (Use condition 1 for host[m] == host[p])
- Then
- host_next[m]> host_next[p] == host_next[q]
- which is in contradiction to q being the max_index of host_next, and
- therefore
- q != p
- With q != m and q != p, we have
- 1. host_next[q] == host[q]> host_next[m]
- 2. host_next[m] == host[m] - 2
- ==> host[q]> host[m] - 2
- We also know that
- host[i] % 2 == 0 for i in range(0, n) ... Premise 3
- host[m]>= host[q] ... Premise 4
- We apply 3 conditions mentioned above together:
- 1. host[i] % 2 == 0 for i in range(0, n)
- 2. host[q]> host[m] - 2
- 3. host[q] <= host[m]
- Then,
- 1. host[m] - 2 <host[q] <= host[m]
- 2. host[q] % 2 == 0
- Therefore,
- host[q] == host[m]
- Left of inequation:
- host_next[q]
- == host[q]
- Right of inequation:
- sum(<host_next[i]>, i in range(0, n) && i != q)
- == sum(<host[j]>) + host[m] - 2 - 2
- == sum(<host[j]>) + host[q] - 4 (since host[q] == host[m])
- , where j in range(0, n) && j != q && j != m
- Therefore, we need to prove that
- sum(<host[j]>) - 4>= 0
- , where j in range(0, n) && j != q && j != m
- According to premise 2, 3, sum of host[j] results in
- 3 following conditions:
- 2.1) sum(<host[j]>) == 0
- Since
- 1. q != m && q != p
- 2. j != q && j != m
- 3. host[p]>= 2 (host_next[p] = host[p] - 2>= 0)
- Then,
- sum(<host[j]>)>= host[p]>= 2
- which is in contradiction to sum(<host[j]>) == 0, and therefore
- condition 2.1 is logically impossible.
- 2.2) sum(<host[j]>) == 2
- According to premise 3, non-zero values in host are
- {host[m], host[q], 2}
- Apply premise 1
- sum(host) % 4 ==0
- ==> host[m] + host[q] + 2 == 4*x
- ==> host[m] + host[q] == 4*x - 2
- ==> 2 * host[q] == 4*x - 2 (since host[q] == host[m])
- ==> host[q] == 2*x - 1
- host[q] == 2*x - 1 is in contradiction to premise 3, and therefore
- condition 2.2 is logically impossible.
- 2.3) sum(<host[j]>)>= 4, conforms.
- Therefore,
- sum(<host[j]>) - 4>= 0
- ==> host[q] <= host[q] + sum(<host[j]>) - 4
- ==> host[q] <= host[m] + sum(<host[j]>) - 4
- ==> host_next[q] <= host_next[m] + sum(<host_next[j]>)
- ==> host_next[q] <= sum(<host_next[i]>)
- , where i in range(0, n) && i != q,
- j in range(0, n) && j != q && j != m
- Since condition 1) and 2) are respectively proved, lemma 2 is proved.
- For each iteration, sum(host_next) == sum(host) - 2 - 2
- Apply Lemma 1. host[m] <= sum(<host[i]>, i in range(0, n) && i != m),
- aka. 2*host[m] <= sum(host) is always true.
- Since sum(host) is monotonically decreasing, and 2*host[m] <= sum(host),
- host[m], aka. the maximum value of host inclines to zero.
- The algorithm converges.
- Q.E.D.
来源: https://juejin.im/entry/5be3e429e51d456fac510b93