Over the last two years, I've worked with a number of teams to deploy their applications leveraging Kubernetes. Getting developers up to speed with Kubernetes jargon can be challenging, so when a Deployment fails, I'm usually paged to figure out what went wrong.
One of my primary goals when working with a client is to automate & educate myself out of that job, so I try to give developers the tools necessary to debug failed deployments. I've catalogued the most common reasons Kubernetes Deployments fail, and I'm sharing my troubleshooting playbook with you!
Without further ado, here are the 10 most common reasons Kubernetes Deployments fail:
Two of the most common problems are (a) having the wrong container image specified and (b) trying to use private images without providing registry credentials. These are especially tricky when starting to work with Kubernetes or wiring up CI/CD for the first time.
Let's see an example. First, we'll create a deployment named
pointing to a non-existent Docker image:
- fail
- $ kubectl run fail--image = rosskukulinski / dne: v1.0.0
We can then inspect our Pods and see that we have one Pod with a status of
or
- ErrImagePull
.
- ImagePullBackOff
- $ kubectl get pods
- NAME READY STATUS RESTARTS AGE
- fail-1036623984-hxoas 0/1 ImagePullBackOff 0 2m
For some additional information, we can
the failing Pod:
- describe
- $ kubectl describe pod fail - 1036623984 - hxoas
If we look in the
section of the output of the
- Events
command we will see something like:
- describe
- Events:
- FirstSeen LastSeen Count From SubObjectPath Type Reason Message
- --------- -------- ----- ---- ------------- -------- ------ -------
- 5m 5m 1 {default-scheduler } Normal Scheduled Successfully assigned fail-1036623984-hxoas to gke-nrhk-1-default-pool-a101b974-wfp7
- 5m 2m 5 {kubelet gke-nrhk-1-default-pool-a101b974-wfp7} spec.containers{fail} Normal Pulling pulling image "rosskukulinski/dne:v1.0.0"
- 5m 2m 5 {kubelet gke-nrhk-1-default-pool-a101b974-wfp7} spec.containers{fail} Warning Failed Failed to pull image "rosskukulinski/dne:v1.0.0": Error: image rosskukulinski/dne not found
- 5m 2m 5 {kubelet gke-nrhk-1-default-pool-a101b974-wfp7} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "fail" with ErrImagePull: "Error: image rosskukulinski/dne not found"
- 5m 11s 19 {kubelet gke-nrhk-1-default-pool-a101b974-wfp7} spec.containers{fail} Normal BackOff Back-off pulling image "rosskukulinski/dne:v1.0.0"
- 5m 11s 19 {kubelet gke-nrhk-1-default-pool-a101b974-wfp7} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "fail" with ImagePullBackOff: "Back-off pulling image \"rosskukulinski/dne:v1.0.0\""
The error string,
tells us that Kubernetes was not able to find the image
- Failed to pull image "rosskukulinski/dne:v1.0.0": Error: image rosskukulinski/dne not found
.
- rosskukulinski/dne:v1.0.0
So then the question is: Why couldn't Kubernetes pull the image
There are three primary culprits besides network connectivity issues:
If you don't notice a typo in your image tag, then it's time to test using your local machine.
I usually start by running
on my local development machine with the exact same image tag. In this case, I would run
- docker pull
.
- docker pull rosskukulinski/dne:v1.0.0
- which will attempt to pull the
- docker pull rosskukulinski/dne
tag. If this succeeds, then that means the original tag specified doesn't exist. This could be due to human error, typo, or maybe a misconfiguration of the CI/CD system.
- latest
If
(without an exact tag) fails, then we have a bigger problem - that image does not exist at all in our image registry. By default, Kubernetes uses the Dockerhub registry. If you're using Quay.io, AWS ECR, or Google Container Registry, you'll need to specify the registry URL in the image string. For example, on Quay, the image would be
- docker pull rosskukulinski/dne
.
- quay.io/rosskukulinski/dne:v1.0.0
If you are using Dockerhub, then you should double check the system that is publishing images to the registry. Make sure the name & tag match what your Deployment is trying to use.
Note: There is no observable difference in Pod status between a missing image and incorrect registry permissions. In either case, Kubernetes will report an
status for the Pods.
- ErrImagePull
Whether your launching a new application on Kubernetes or migrating an existing platform, having the application crash on startup is a common occurrence.
Let's create a new Deployment with an application that crashes after 1 second:
- $ kubectl run crasher--image = rosskukulinski / crashing - app
Then let's take a look at the status of our Pods:
- $ kubectl get pods
- NAME READY STATUS RESTARTS AGE
- crasher-2443551393-vuehs 0/1 CrashLoopBackOff 2 54s
Ok, so
tells us that Kuberenetes is trying to launch this Pod, but one or more of the containers is crashing or getting killed.
- CrashLoopBackOff
Let's
the pod to get some more information:
- describe
- $ kubectl describe pod crasher-2443551393-vuehs
- Name: crasher-2443551393-vuehs
- Namespace: fail
- Node: gke-nrhk-1-default-pool-a101b974-wfp7/10.142.0.2
- Start Time: Fri, 10 Feb 2017 14:20:29 -0500
- Labels: pod-template-hash=2443551393
- run=crasher
- Status: Running
- IP: 10.0.0.74
- Controllers: ReplicaSet/crasher-2443551393
- Containers:
- crasher:
- Container ID: docker://51c940ab32016e6d6b5ed28075357661fef3282cb3569117b0f815a199d01c60
- Image: rosskukulinski/crashing-app
- Image ID: docker://sha256:cf7452191b34d7797a07403d47a1ccf5254741d4bb356577b8a5de40864653a5
- Port:
- State: Terminated
- Reason: Error
- Exit Code: 1
- Started: Fri, 10 Feb 2017 14:22:24 -0500
- Finished: Fri, 10 Feb 2017 14:22:26 -0500
- Last State: Terminated
- Reason: Error
- Exit Code: 1
- Started: Fri, 10 Feb 2017 14:21:39 -0500
- Finished: Fri, 10 Feb 2017 14:21:40 -0500
- Ready: False
- Restart Count: 4
- ...
Awesome! Kubernetes is telling us that this Pod is being
due to the application inside the container crashing. Specifically, we can see that the application
- Terminated
is
- Exit Code
. We might also see an
- 1
error, but we'll get to that later.
- OOMKilled
So our application is crashing ... why
The first thing we can do is check our application logs. Assuming you are sending your application logs to
(which you should be!), you can see the application logs using
- stdout
.
- kubectl logs
- $ kubectl logs crasher - 2443551393 - vuehs
Unfortunately, this Pod doesn't seem to have any log data. It's possible we're looking at a newly-restarted instance of the application, so we should check the previous container:
- $ kubectl logs crasher - 2443551393 - vuehs--previous
Rats! Our application still isn't giving us anything to work with. It's probably time to add some additional log messages on startup to help debug the issue. We might also want to try running the container locally to see if there are missing environmental variables or mounted volumes.
Kubernetes best practices recommend passing application run-time configuration via ConfigMaps or Secrets. This data could include database credentials, API endpoints, or other configuration flags.
A common mistake that I've seen developers make is to create Deployments that reference properties of ConfigMaps or Secrets that don't exist or even non-existent ConfigMaps/Secrets.
Let's see what that might look like.
For our first example, we're going to try to create a Pod that loads ConfigMap data as environmental variables.
- # configmap-pod.yaml
- apiVersion: v1
- kind: Pod
- metadata:
- name: configmap-pod
- spec:
- containers:
- - name: test-container
- image: gcr.io/google_containers/busybox
- command: [ "/bin/sh", "-c", "env" ]
- env:
- - name: SPECIAL_LEVEL_KEY
- valueFrom:
- configMapKeyRef:
- name: special-config
- key: special.how
Let's create a Pod,
. After waiting a few minutes, we can peek at our pods:
- kubectl create -f configmap-pod.yaml
- $ kubectl get pods
- NAME READY STATUS RESTARTS AGE
- configmap-pod 0/1 RunContainerError 0 3s
Our Pod's status says
. We can use
- RunContainerError
to learn more:
- kubectl describe
- $ kubectl describe pod configmap-pod
- [...]
- Events:
- FirstSeen LastSeen Count From SubObjectPath Type Reason Message
- --------- -------- ----- ---- ------------- -------- ------ -------
- 20s 20s 1 {default-scheduler } Normal Scheduled Successfully assigned configmap-pod to gke-ctm-1-sysdig2-35e99c16-tgfm
- 19s 2s 3 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Pulling pulling image "gcr.io/google_containers/busybox"
- 18s 2s 3 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Pulled Successfully pulled image "gcr.io/google_containers/busybox"
- 18s 2s 3 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "test-container" with RunContainerError: "GenerateRunContainerOptions: configmaps \"special-config\" not found"
The last item in the
section explains what went wrong. The Pod is attempting to access a ConfigMap named
- Events
, but it's not found in this namespace. Once we create the ConfigMap, the Pod should restart and pull in the runtime data.
- special-config
Accessing Secrets as environmental variables within your Pod specification will result in similar errors, like we've seen here with ConfigMaps.
But what if you're accessing a Secret or a ConfigMap via a volume
Here's a Pod spec that references a Secret named
and attempts to mount it as a volume.
- myothersecret
- # missing-secret.yaml
- apiVersion: v1
- kind: Pod
- metadata:
- name: secret-pod
- spec:
- containers:
- - name: test-container
- image: gcr.io/google_containers/busybox
- command: [ "/bin/sh", "-c", "env" ]
- volumeMounts:
- - mountPath: /etc/secret/
- name: myothersecret
- restartPolicy: Never
- volumes:
- - name: myothersecret
- secret:
- secretName: myothersecret
Let's create this Pod with
.
- kubectl create -f missing-secret.yaml
After a few minutes, when we get our Pods, we'll see that it still is in the state of
.
- ContainerCreating
- $ kubectl get pods
- NAME READY STATUS RESTARTS AGE
- secret-pod 0/1 ContainerCreating 0 4h
That's odd ... let's
the Pod to see whats going on.
- describe
- $ kubectl describe pod secret-pod
- Name: secret-pod
- Namespace: fail
- Node: gke-ctm-1-sysdig2-35e99c16-tgfm/10.128.0.2
- Start Time: Sat, 11 Feb 2017 14:07:13 -0500
- Labels:
- Status: Pending
- IP:
- Controllers:
- [...]
- Events:
- FirstSeen LastSeen Count From SubObjectPath Type Reason Message
- --------- -------- ----- ---- ------------- -------- ------ -------
- 18s 18s 1 {default-scheduler } Normal Scheduled Successfully assigned secret-pod to gke-ctm-1-sysdig2-35e99c16-tgfm
- 18s 2s 6 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} Warning FailedMount MountVolume.SetUp failed for volume "kubernetes.io/secret/337281e7-f065-11e6-bd01-42010af0012c-myothersecret" (spec.Name: "myothersecret") pod "337281e7-f065-11e6-bd01-42010af0012c" (UID: "337281e7-f065-11e6-bd01-42010af0012c") with: secrets "myothersecret" not found
Once again, the
section explains the problem. It's telling us that the Kubelet failed to mount a volume from the secret,
- Events
. To fix this problem, create
- myothersecret
containing the necessary secure credentials. Once
- myothersecret
has been created, the container will start correctly.
- myothersecret
An important lesson for developers to learn when working with containers and Kubernetes is that just because your application container is running, doesn't mean that it's working.
Kubernetes provides two essential features called Liveness Probes and Readiness Probes. Essentially, Liveness/Readiness Probes will periodically perform an action (e.g. make an HTTP request, open a tcp connection, or run a command in your container) to confirm that your application is working as intended.
If the Liveness Probe fails, Kubernetes will kill your container and create a new one. If the Readiness Probe fails, that Pod will not be available as a Service endpoint, meaning no traffic will be sent to that Pod until it becomes
.
- Ready
If you attempt to deploy a change to your application that fails the Liveness/Readiness Probe, the rolling deploy will hang as it waits for all of your Pods to become Ready.
So what does this look like? Here's a Pod spec that defines a Liveness & Readiness Probe that checks for a healthy HTTP response for
on port 8080.
- /healthz
- apiVersion: v1
- kind: Pod
- metadata:
- name: liveness-pod
- spec:
- containers:
- - name: test-container
- image: rosskukulinski/leaking-app
- livenessProbe:
- httpGet:
- path: /healthz
- port: 8080
- initialDelaySeconds: 3
- periodSeconds: 3
- readinessProbe:
- httpGet:
- path: /healthz
- port: 8080
- initialDelaySeconds: 3
- periodSeconds: 3
Let's create this Pod,
, and then see what happens after a few minutes:
- kubectl create -f liveness.yaml
- $ kubectl get pods
- NAME READY STATUS RESTARTS AGE
- liveness-pod 0/1 Running 4 2m
After 2 minutes, we can see that our Pod is still not "Ready", and it has been restarted four times. Let's
the Pod for more information.
- describe
- $ kubectl describe pod liveness-pod
- Name: liveness-pod
- Namespace: fail
- Node: gke-ctm-1-sysdig2-35e99c16-tgfm/10.128.0.2
- Start Time: Sat, 11 Feb 2017 14:32:36 -0500
- Labels:
- Status: Running
- IP: 10.108.88.40
- Controllers:
- Containers:
- test-container:
- Container ID: docker://8fa6f99e6fda6e56221683249bae322ed864d686965dc44acffda6f7cf186c7b
- Image: rosskukulinski/leaking-app
- Image ID: docker://sha256:7bba8c34dad4ea155420f856cd8de37ba9026048bd81f3a25d222fd1d53da8b7
- Port:
- State: Running
- Started: Sat, 11 Feb 2017 14:40:34 -0500
- Last State: Terminated
- Reason: Error
- Exit Code: 137
- Started: Sat, 11 Feb 2017 14:37:10 -0500
- Finished: Sat, 11 Feb 2017 14:37:45 -0500
- [...]
- Events:
- FirstSeen LastSeen Count From SubObjectPath Type Reason Message
- --------- -------- ----- ---- ------------- -------- ------ -------
- 8m 8m 1 {default-scheduler } Normal Scheduled Successfully assigned liveness-pod to gke-ctm-1-sysdig2-35e99c16-tgfm
- 8m 8m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Created Created container with docker id 0fb5f1a56ea0; Security:[seccomp=unconfined]
- 8m 8m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Started Started container with docker id 0fb5f1a56ea0
- 7m 7m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Created Created container with docker id 3f2392e9ead9; Security:[seccomp=unconfined]
- 7m 7m 1 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Normal Killing Killing container with docker id 0fb5f1a56ea0: pod "liveness-pod_fail(d75469d8-f090-11e6-bd01-42010af0012c)" container "test-container" is unhealthy, it will be killed and re-created.
- 8m 16s 10 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Warning Unhealthy Liveness probe failed: Get http://10.108.88.40:8080/healthz: dial tcp 10.108.88.40:8080: getsockopt: connection refused
- 8m 1s 85 {kubelet gke-ctm-1-sysdig2-35e99c16-tgfm} spec.containers{test-container} Warning Unhealthy Readiness probe failed: Get http://10.108.88.40:8080/healthz: dial tcp 10.108.88.40:8080: getsockopt: connection refused
Once again, the
section comes to the rescue. We can see that the Readiness and Liveness probes are both failing. The key string to look for is,
- Events
. This tells us that Kubernetes is killing the container because the Liveness Probe has failed.
- container "test-container" is unhealthy, it will be killed and re-created
There are likely three possibilities:
Looking at the logs from your Pod is a good place to start debugging. Once you resolve this issue, a fresh Deployment should succeed.
Kubernetes gives cluster administrators the ability to limit the amount of CPU or memory allocated to Pods and Containers. As an application developer, you might not know about the limits and then be surprised when your Deployment fails.
Let's attempt to create this Deployment in a cluster with an unknown CPU/Memory request limit:
- # gateway.yaml
- apiVersion: extensions/v1beta1
- kind: Deployment
- metadata:
- name: gateway
- spec:
- template:
- metadata:
- labels:
- app: gateway
- spec:
- containers:
- - name: test-container
- image: nginx
- resources:
- requests:
- memory: 5Gi
You'll notice that we're setting a resource request of 5Gi. Let's create the deployment:
.
- kubectl create -f gateway.yaml
Now we can look at our Pods:
- $ kubectl get pods
- No resources found.
Huh? Let's inspect our Deployment using
:
- describe
- $ kubectl describe deployment/gateway
- Name: gateway
- Namespace: fail
- CreationTimestamp: Sat, 11 Feb 2017 15:03:34 -0500
- Labels: app=gateway
- Selector: app=gateway
- Replicas: 0 updated | 1 total | 0 available | 1 unavailable
- StrategyType: RollingUpdate
- MinReadySeconds: 0
- RollingUpdateStrategy: 0 max unavailable, 1 max surge
- OldReplicaSets:
- NewReplicaSet: gateway-764140025 (0/1 replicas created)
- Events:
- FirstSeen LastSeen Count From SubObjectPath Type Reason Message
- --------- -------- ----- ---- ------------- -------- ------ -------
- 4m 4m 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica set gateway-764140025 to 1
Based on that last line, our deployment created a ReplicaSet (
) and scaled it up to 1. The ReplicaSet is the entity that manages the lifecycle of the Pods. We can
- gateway-764140025
the ReplicaSet:
- describe
- $ kubectl describe rs/gateway-764140025
- Name: gateway-764140025
- Namespace: fail
- Image(s): nginx
- Selector: app=gateway,pod-template-hash=764140025
- Labels: app=gateway
- pod-template-hash=764140025
- Replicas: 0 current / 1 desired
- Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
- No volumes.
- Events:
- FirstSeen LastSeen Count From SubObjectPath Type Reason Message
- --------- -------- ----- ---- ------------- -------- ------ -------
- 6m 28s 15 {replicaset-controller } Warning FailedCreate Error creating: pods "gateway-764140025-" is forbidden: [maximum memory usage per Pod is 100Mi, but request is 5368709120., maximum memory usage per Container is 100Mi, but request is 5Gi.]
Ahh! There we go. The cluster administrator has set a maximum memory usage per Pod of
(what a cheapskate!). You can inspect the current namespace limits by running
- 100Mi
.
- kubectl describe limitrange
You now now have three choices:
FTW!)
- kubectl edit
And that's the first 5 most common reasons Kubernetes Deployments fail. Click here for Part 2 which has #6-10.
来源: https://juejin.im/entry/5a0978fdf265da4304061f57