运行生产级别的 Kubernetes 集群,无论您的集群运行的多稳定,定期备份是未雨绸缪,一定要做的工作。
Kubernetes 集群的运行状态都保存在 ETCD 中,为了确保您生产环境的稳定性。建议您定期备份。下面为您介绍如何对阿里云容器服务 Kubernetes 进行备份和恢复。
1. 如何在备份阿里云容器服务 Kubernetes 的 ETCD 数据
首先由于 ETCD 有三个备份,并且会同步,所以您只需要在一台 master 机器上执行 ETCD 备份即可。
另外在运行下列命令前,确保当前机器的 kube-apiserver 是运行的。
- ps -ef|grep kube-apiserver
- root 2063 2047 1 1月05 ? 00:41:01 kube-apiserver
执行备份命令
- export ETCD_SERVERS = $(ps - ef | grep apiserver | grep - Eo "etcd-servers=.*2379" | awk - F = '{print $NF}') mkdir - p /
- var / lib / etcd_backup / ETCDCTL_API = 3 etcdctl snapshot--endpoints = $ETCD_SERVERS--cacert = /var/lib / etcd / cert / ca.pem--cert = /var/lib / etcd / cert / etcd - client.pem--key = /var/lib / etcd / cert / etcd - client - key.pem save /
- var / lib / etcd_backup / backup_$(date "+%Y%m%d%H%M%S").db Snapshot saved at /
- var / lib / etcd_backup / backup_20180107172459.db
执行完成后,您可以在 / var/lib/etcd_backup 中找到备份的 snapshot
- [root@iZwz95q64qi83o88y9lq4cZ etcd_backup]# cd /var/lib/etcd_backup/
- [root@iZwz95q64qi83o88y9lq4cZ etcd_backup]# ls
- backup_20180107172459.db
- [root@iZwz95q64qi83o88y9lq4cZ etcd_backup]# du -sh backup_20180107172459.db
- 8.0M backup_20180107172459.db
2. 利用 ETCD 的备份恢复 Kubernetes 集群
2.1 首先需要分别停掉三台 Master 机器的 kube-apiserver,
- mkdir -p /etc/kubernetes/manifests-backups
- mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/manifests-backups/
2.2 确保 kube-apiserver 已经停止了, 执行下列命令返回值为 0
- ps -ef|grep kube-api|grep -v grep |wc -l
- 0
2.3 分别在三台 Master 节点上,停止 ETCD 服务
service etcd stop
2.4 确保 ETCD 停止成功
- ps -ef|grep etcd|grep -v etcd|wc -l
- 0
2.5 移除 ETCD 数据目录
- mv /var/lib/etcd/data.etcd /var/lib/etcd/data.etcd_bak
2.6 分别在各个节点恢复数据, 首先需要拷贝数据到每个 master 节点, 假设备份数据存在于 / var/lib/etcd_backup/backup_20180107172459.db
- scp /var/lib/etcd_backup/backup_20180107172459.db root@master1:/var/lib/etcd_backup/
- scp /var/lib/etcd_backup/backup_20180107172459.db root@master2:/var/lib/etcd_backup/
- scp /var/lib/etcd_backup/backup_20180107172459.db root@master3:/var/lib/etcd_backup/
执行恢复命令
- set -x
- export ETCD_NAME=$(cat /usr/lib/systemd/system/etcd.service|grep ExecStart|grep -Eo "name.*-name-[0-9].*--client"|awk '{print $2}')
- export ETCD_CLUSTER=$(cat /usr/lib/systemd/system/etcd.service|grep ExecStart|grep -Eo "initial-cluster.*--initial"|awk '{print $2}')
- export ETCD_INITIAL_CLUSTER_TOKEN=$(cat /usr/lib/systemd/system/etcd.service|grep ExecStart|grep -Eo "initial-cluster-token.*"|awk '{print $2}')
- export ETCD_INITIAL_ADVERTISE_PEER_URLS=$(cat /usr/lib/systemd/system/etcd.service|grep ExecStart|grep -Eo "initial-advertise-peer-urls.*--listen-peer"|awk '{print $2}')
- ETCDCTL_API=3 etcdctl snapshot --cacert=/var/lib/etcd/cert/ca.pem --cert=/var/lib/etcd/cert/etcd-client.pem --key=/var/lib/etcd/cert/etcd-client-key.pem restore /var/lib/etcd_backup/backup_20180107172459.db \
- --name $ETCD_NAME \
- --data-dir /var/lib/etcd/data.etcd \
- --initial-cluster $ETCD_CLUSTER \
- --initial-cluster-token $ETCD_INITIAL_CLUSTER_TOKEN \
- --initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS
- chown -R etcd:etcd /var/lib/etcd/data.etcd
2.7 分别在三个 master 节点启动 ETCD,并且通过 service 命令确认启动成功
- # service etcd start
- # service etcd status
2.8 检查 ETCD 的健康
- # export ETCD_SERVERS=$(cat /etc/kubernetes/manifests-backups/kube-apiserver.yaml |grep etcd-server|awk -F= '{print $2}')
- ETCDCTL_API=3 etcdctl endpoint health --endpoints=$ETCD_SERVERS --cacert=/var/lib/etcd/cert/ca.pem --cert=/var/lib/etcd/cert/etcd-client.pem --key=/var/lib/etcd/cert/etcd-client-key.pem
- https://192.168.250.198:2379 is healthy: successfully committed proposal: took = 2.238886ms
- https://192.168.250.196:2379 is healthy: successfully committed proposal: took = 3.390819ms
- https://192.168.250.197:2379 is healthy: successfully committed proposal: took = 2.925103ms
2.9 如果 ETCD 是健康的,就到每台 Master 上恢复 kube-apiserver
- #mv / etc / kubernetes / manifests - backups / kube - apiserver.yaml / etc / kubernetes / manifests /
2.10 检查集群是否恢复正常,可以看到集群已经正常启动了。之前部署的应用也还在。
- # kubectl get cs
- NAME STATUS MESSAGE ERROR
- controller-manager Healthy ok
- scheduler Healthy ok
- etcd-0 Healthy {"health": "true"}
- etcd-2 Healthy {"health": "true"}
- etcd-1 Healthy {"health": "true"}
- # kubectl get no
- NAME STATUS ROLES AGE VERSION
- cn-shenzhen.i-wz90xxpi51k2u51t5y0p Ready master 44d v1.8.4
- cn-shenzhen.i-wz93236e8pccdscwz3ha Ready master 44d v1.8.4
- cn-shenzhen.i-wz953xx6qnlzdi6vo2aa Ready <none> 44d v1.8.4
- cn-shenzhen.i-wz953xx6qnlzdi6vo2ab Ready <none> 44d v1.8.4
- kubectl get deploy
- NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
- nginx 1 1 1 1 23d
总结:
Kubernetes 的备份主要是通过 ETCD 的备份完成的。而恢复时,主要考虑的是整个顺序:停止 kube-apiserver,停止 ETCD,恢复数据,启动 ETCD,启动 kube-apiserver。欢迎您使用阿里云容器服务 Kubernetes 集群。
阿里云容器服务提供了托管的 Kubernetes 集群支持,了解更多阿里云容器服务内容, 请访问 https://www.aliyun.com/product/containerservice
来源: https://yq.aliyun.com/articles/336781