TensorFlow Serving 是由 Google 开源的机器学习模型预测系统, 能够简化并加速从模型到生产应用的过程它可以将训练好的机器学习模型部署到线上, 使用 gRPC 作为接口接受外部调用更给人惊喜后的是, 它还提供了不宕机的模型更新和版本管理这大大降低了模型提供商在线上管理的复杂性, 可以将注意力都放在模型优化上
TensorFlow Serving 本质上也是一个在线服务, 我们需要考虑它的部署时刻的安装配置, 运行时刻的负载均衡, 弹性伸缩, 高可用性以及滚动升级等问题, 幸运的是这正是 Kubernetes 擅长的地方利用 Kubernetes 的内置自动化能力, 将极大的降低 TensorFLow Serving 应用运维的成本
今天将介绍如何利用 Kubernetes 的官方包管理工具 Helm 在阿里云容器服务上准备模型, 部署 TensorFlow Serving, 并且进行手动扩容
1. 准备模型
由于 TensorFLow Serving 需要用持久化存储加载预测模型, 这里就需要准备相应的存储在阿里云容器服务里, 您可以选择 NAS,OSS 和云盘, 具体可以参考文档阿里云 Kubernetes 的存储管理本文以 NAS 存储为例介绍如何导入数据模型
1.1 创建 NAS 文件存储, 并且设置 vpc 内挂载点可以参考阿里云 NAS 文档并且查看挂载点, 这里假设挂载点为
3fcc94a4ec-rms76.cn-shanghai.nas.aliyuncs.com
1.2 利用一台阿里云虚拟机准备模型数据, 首先创建文件夹
- mkdir /nfs
- mount -t nfs -o vers=4.0 3fcc94a4ec-rms76.cn-shanghai.nas.aliyuncs.com:$nfs
- mkdir -p /nfs/serving
- umount /nfs
1.3 下载预测模型并且保存到 NAS 里
- mkdir /serving
- mount -t nfs -o vers=4.0 3fcc94a4ec-rms76.cn-shanghai.nas.aliyuncs.com:/serving /serving
- mkdir -p /serving/model
- cd /serving/model
- curl -O http://tensorflow-samples.oss-cn-shenzhen.aliyuncs.com/exports/mnist-export.tar.gz
- tar -xzvf mnist-export.tar.gz
- rm -rf mnist-export.tar.gz
- cd /
1.4 这样你可以就可以很直观的看到预测模型的内容, 检查后可以 umount 掉挂载点
- tree /serving/model/mnist
- /serving/model/mnist
- 1
- saved_model.pb
- variables
- variables.data-00000-of-00001
- variables.index
- umount /serving
2. 创建持久化数据卷
2.1 以下为创建 NAS 的 nas.yaml 样例
- ---
- apiVersion: v1
- kind: PersistentVolume
- metadata:
- labels:
- model: mnist
- name: pv-nas
- spec:
- persistentVolumeReclaimPolicy: Retain
- accessModes:
- - ReadWriteMany
- capacity:
- storage: 5Gi
- flexVolume:
- driver: alicloud/nas
- options:
- mode: "755"
- path: /serving/model/mnist
- server: 3fcc94a4ec-rms76.cn-shanghai.nas.aliyuncs.com
- vers: "4.0"
注意这里需要指定 label 为 model: mnist, storageClassName 需要为 nas, 这两个标签对于 pvc 选择 pv 绑定非常重要
另外和 NAS 相关的具体配置可以参考 Kubernetes 使用阿里云 NAS
2.2 在 Kubernetes 管理控制台, 选择持久化存储卷
2.3 稍等片刻后, 可以看到持久化存储卷已经创建成功了
当然也可以运行 kubectl 命令创建
- kubectl create -f nas.yaml
- persistentvolume "pv-nas" created
3. 通过 Helm 部署 TensorFlow Serving 的应用
3.1 可以通过应用目录, 点击
acs-tensorflow-serving
3.2 点击参数, 就可以通过修改参数配置点击部署
创建支持 GPU 的自定义配置参数:
- ---
- serviceType: LoadBalancer
- ## expose the service to the grpc client
- port: 9090
- replicas: 1
- image: "registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/tensorflow-serving:1.4.0-devel-gpu"
- imagePullPolicy: "IfNotPresent"
- ## the gpu resource to claim, for cpu, change it to 0
- gpuCount: 1
- ## The command and args to run the pod
- command: ["/usr/bin/tensorflow_model_server"]
- args: [ "--port=9090", "--model_name=mnist", "--model_base_path=/serving/model/mnist"]
- ## the mount path inside the container
- mountPath: /serving/model/mnist
- persistence:
- ## The request and label to select the persistent volume
- pvc:
- storage: 5Gi
- matchLabels:
- model: mnist
创建支持非 GPU 的自定义配置参数:
- ---
- serviceType: LoadBalancer
- ## expose the service to the grpc client
- port: 9090
- replicas: 1
- command:
- - /usr/bin/tensorflow_model_server
- args:
- - "--port=9090"
- - "--model_name=mnist"
- - "--model_base_path=/serving/model/mnist"
- image: "registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/tensorflow-serving:1.4.0-devel"
- imagePullPolicy: "IfNotPresent"
- mountPath: /serving/model/mnist
- persistence:
- mountPath: /serving/model/mnist
- pvc:
- matchLabels:
- model: mnist
- storage: 5Gi
也可以登录到 Kubernetes master 运行以下命令
# helm install --values serving.yaml --name mnist incubator/acs-tensorflow-serving
4. 查看 TensorFlow-serving 的应用部署
4.1 登录到 Kubernetes 的 master 上利用 helm 命令查看部署应用的列表
- # helm list
- NAME REVISION UPDATED STATUS CHART NAMESPACE
- mnist-deploy 1 Fri Mar 16 19:24:35 2018 DEPLOYED acs-tensorflow-serving-0.1.0 default
4.2 利用 helm status 命令检查具体应用的配置
- # helm status mnist-deploy
- LAST DEPLOYED: Fri Mar 16 19:24:35 2018
- NAMESPACE: default
- STATUS: DEPLOYED
- RESOURCES:
- ==> v1/Service
- NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
- mnist-deploy-acs-tensorflow-serving LoadBalancer 172.19.0.219 139.195.1.216 9090:32560/TCP 5h
- ==> v1beta1/Deployment
- NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
- mnist-deploy-serving 1 1 1 1 5h
- ==> v1/Pod(related)
- NAME READY STATUS RESTARTS AGE
- mnist-deploy-serving-665fc69d84-pk9bk 1/1 Running 0 5h
TensoFlow Serving 的对外服务地址是 ExTERNAL_IP: 139.195.1.216, 端口为 9090
对应部署的是 mnist-deploy-serving, 这个信息在扩容时刻是需要的
4.3 查看 tensorflow-serving 的下 pod 的日志, 发现 mnist 的模型已经加载到内存里, 并且 GPU 已经正常启动
- # kubectl logs mnist-deploy-serving-665fc69d84-pk9bk
- 2018-03-16 11:28:08.393864: I tensorflow_serving/model_servers/main.cc:147] Building single TensorFlow model file config: model_name: mnist model_base_path: /serving/model/mnist
- 2018-03-16 11:28:08.394115: I tensorflow_serving/model_servers/server_core.cc:441] Adding/updating models.
- 2018-03-16 11:28:08.394174: I tensorflow_serving/model_servers/server_core.cc:492] (Re-)adding model: mnist
- 2018-03-16 11:28:08.504522: I tensorflow_serving/core/basic_manager.cc:705] Successfully reserved resources to load servable {name: mnist version: 1}
- 2018-03-16 11:28:08.504591: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: mnist version: 1}
- 2018-03-16 11:28:08.504610: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: mnist version: 1}
- 2018-03-16 11:28:08.504643: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /serving/model/mnist/1
- 2018-03-16 11:28:08.504674: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:236] Loading SavedModel from: /serving/model/mnist/1
- 2018-03-16 11:28:08.703464: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
- 2018-03-16 11:28:08.703865: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
- name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
- pciBusID: 0000:00:08.0
- totalMemory: 15.89GiB freeMemory: 15.60GiB
- 2018-03-16 11:28:08.703899: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
- 2018-03-16 11:28:08.898765: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:155] Restoring SavedModel bundle.
- 2018-03-16 11:30:26.306194: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running LegacyInitOp on SavedModel bundle.
- 2018-03-16 11:30:26.309782: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:284] Loading SavedModel: success. Took 137805089 microseconds.
- 2018-03-16 11:30:26.320057: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: mnist version: 1}
- E0316 11:30:26.322709112 1 ev_epoll1_linux.c:1051] grpc epoll fd: 23
- 2018-03-16 11:30:26.324023: I tensorflow_serving/model_servers/main.cc:288] Running ModelServer at 0.0.0.0:9090 ...
5. 根据前面获得的外部地址 139.195.1.216, 在本地启动客户端程序测试
- # docker run -it --rm registry.cn-beijing.aliyuncs.com/tensorflow-samples/tf-mnist:grpcio_upgraded /serving/bazel-bin/tensorflow_serving/example/mnist_client --num_tests=1000 --server=139.195.1.216:9090
- Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
- Extracting /tmp/train-images-idx3-ubyte.gz
- Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
- Extracting /tmp/train-labels-idx1-ubyte.gz
- Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
- Extracting /tmp/t10k-images-idx3-ubyte.gz
- Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
- Extracting /tmp/t10k-labels-idx1-ubyte.gz
- ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
- Inference error rate: 10.4%
6. 扩容 TensoFlow Serving,
因为 helm 命令无法实现扩容的能力, 这里需要使用 kubectl 原生命令输入的参数有两个, 一个是扩容目标 2, 另一个是通过 helm status 查询到的 Deployment
- # kubectl scale --replicas 2 deployment/mnist-deploy-serving
- deployment "mnist-deploy-serving" scaled
通过
time helm status mnist-deploy
查询到目前的 TensoFlow Serving 实例数为 2
- # helm status mnist-deploy
- LAST DEPLOYED: Fri Mar 16 19:24:35 2018
- NAMESPACE: default
- STATUS: DEPLOYED
- RESOURCES:
- ==> v1/Service
- NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
- mnist-deploy-acs-tensorflow-serving LoadBalancer 172.19.0.219 139.196.1.217 9090:32560/TCP 5h
- ==> v1beta1/Deployment
- NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
- mnist-deploy-serving 2 2 2 2 5h
- ==> v1/Pod(related)
- NAME READY STATUS RESTARTS AGE
- mnist-deploy-serving-665fc69d84-7sfvn 1/1 Running 0 9m
- mnist-deploy-serving-665fc69d84-pk9bk 1/1 Running 0 5h
总结
本文向您展示了如何利用阿里云 Kubernetes 容器服务快速使用开箱即用的 TensoFlow Serving 能力, 并且支持一键式的扩缩容, 释放了深度学习的洪荒之力同时阿里云 Kubernetes 为深度学习提供了丰富的基础设施能力, 从弹性计算负责均衡到对象存储, 日志监控等等将二者结合起来, 可以帮助数据科学家专注于模型本身, 无需在应用运维方面牵扯过多的精力
阿里云容器服务团队也会在提供简单易用的 GPU 加速和深度学习解决方案方面持续发力, 进一步提高云端深度学习训练和预测的效能
来源: https://yq.aliyun.com/articles/553659