Kubernetes 常见故障排查和处理

一、排查命令和方式

kubectl get pods
kubectl describe pods my-pod
kubectl logs my-pod
kubectl exec my-pod -it /bin/bash 后进入容器排查
查看宿主机日志文件
- /var/log/pods/*
- /var/log/containers/*

二、pod 故障排查处理

1、查看方式

$ kubectl getpods -n namespace

在上图 status 列，我们可以看到 pod 容器的状态

2、查看 STATUS 状态

以下是 status list：Running，Succeeded，Waiting，ContainerCreating，Failed，Pending，Terminating，unknown，CrashLoopBackOff，ErrImagePull，ImagePullBackOff

status 定义说明：

如出现异常状态，可查看pod日志内容

$ kubectl describepod 容器名 -n namespace

查看 State 状态

3、查看 Conditions 状态

True 表示成功，False 表示失败

Initialized pod 容器初始化完
Ready pod 可正常提供服务
ContainersReady 容器可正常提供服务
PodScheduled pod 正在调度中，有合适的节点就会绑定，并更新到etcd
Unschedulable pod 不能调度，没有找到合适的节点

如有 False 状态显示，查看 Events 信息

Reason 显示 Unhealthy 异常，仔细查看后面的报错信息，有针对性修复

4、 Events报错信息整理如下

a、Failed to pull image "xxx": Error: image xxx not found

原因：提示拉取镜像失败，找不到镜像

解决方式：找到可以访问的镜像地址以及正确的tag ，并修改。镜像仓库未login，需要login。K8s 没有 pull 镜像的权限，需要开通权限再 pull

b、Warning FailedSync Error syncing pod, skipping: failed to with RunContainerError: "GenerateRun ContainerOptions: XXX not found"

原因：此pod XXX 的 name 在 namespace下找不到，

解决方式：需要重启 pod 解决，kubectl replace --force -f pod.yaml

c、Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "XXX" with RunContainerError: "GenerateRunContainerOptions: configmaps "XXX" not found"

原因：Namespace 下找不到 XXX 命名的 ConfigMap，

解决方式：重新新建 ConfigMap，kubectl create -f configmap.yaml

d、Warning FailedMount MountVolume.SetUp failed for volume "kubernetes.io/secret/ " (spec.Name: "XXXsecret") pod with: secrets "XXXsecret" not found

原因：缺失 Secret

解决方式：需要新建 Secret，kubectl create secret docker-registry secret名 --docker-server=仓库url --docker-username=xxx --docker-password=xxx -n namespace，以下内容，如修改 yaml 文件后，执行 kubectl apply -f pod.yaml 重启 pod 才生效

e、Normal Killing Killing container with docker id XXX: pod "XXX" container "XXX" is unhealthy, it will be killed and re-created.

容器的活跃度探测失败， Kubernetes 正在 kill 问题容器

原因：探测不正确，health 检查的 URL 不正确，或者应用未响应

解决方式：修改 yaml 文件内 health 检查的 periodSeconds 等数值，调大

f、Warning FailedCreate Error creating: pods "XXXX" is forbidden:[maximum memory usage per Pod is XXX, but request is XXX, maximum memory usage per Container is XXX, but request is XXX.]

原因：K8s 内存限制配额小于 pod 使用的大小，导致报错

解决方式：调大 k8s 内存配额，或者减小 pod 的内存大小解决

g、pod (XXX) failed to fit in any node fit failure on node (XXX): Insufficient cpu

原因：node 没有足够的 CPU 供调用，

解决方式：需要减少 pod 内 cpu 的使用数量,yaml 内修改

h、FailedMount Unable to mount volumes for pod "XXX": timeout expired waiting for volumes to attach/mount for pod "XXX"/"fail". list of unattached/unmounted volumes=XXX FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "XXX"/"fail". list of unattached/unmounted volumes=XXX

原因：pod XXX 挂载卷失败

解决方式：需要查看下是否建了卷, volume mountPath 目录是否正确用 yaml 文件建 volume 并 mount

i、FailedMount Failed to attach volume "XXX" on node "XXX" with: GCE persistent disk not found: diskName="XXX disk" zone=""

解决方式：检查 persistent disk 是否正确创建 Yaml 文件创建persistent 方式如下

j、error: error validating "XXX.yaml": error validating data: found invalid field resources for PodSpec; if you choose to ignore these errors, turn validation off with --validate=fals

原因：yaml 文件错误，一般是多了或者少了空格导致。

解决方式：需要校验 yaml 是否正确，可使用 kubeval 工具校验 yaml

k、容器镜像不更新

解决方式：deployment 中指定强制更新策略 ImagePullPolicy: Always

l、(combined from similar events): Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with: Number of node(s) with BGP peering established = 0

原因：指定 node 节点 calico 网络不通，

解决方式：检查 calico 相关镜像是否 pull 成功，calico-node 容器是否正常启动。如镜像和容器正常，需要 reset 重置该节点 k8s，重新加入集群

$ kubeadm reset

$ kubeadm join ip:6443 --token XXXXX.XXXXXXXXX --discovery-token-ca-cert-hash

sha256:XXXXXXXXXXXXXXXXXXX

m、RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed pulling image "gcr.io/google_containers/pause-amd64:": Get https://gcr.io/v1/_ping: dial tcp :443: i/o timeout

原因：gcr.io被GFW墙了

解决方式：找阿里或 googlecontainer 其他可用的镜像，Docker tag 到 gcr.io/google_containers/pause-amd64

n、Warning FailedCreatePodSandBox 3m (x13 over 3m) kubelet, Failed create pod sandbox

执行 journalctl -xe | grep cni 发现 failed to find plugin “loopback” in path [/opt/loopback/bin /usr/local/bin]

解决方式：需要在 /usr/local/bin 内复制 loopback

三、node 节点故障排查处理

$ kubectl get node -n namespace

查看 Node 节点状态， STATUS Ready 表示正常，NotReady 不正常

注意 version 必须保持一致，如有 NotReady 问题，需要重启节点kubectl，或者重启 docker，如不能解决，需要 reset 节点后，k8s 重新join 该 node。查看 node 日志，执行 kubectl describe node node名 -n namespace，如有 “node ip” not found，检查 node ip 是否能 ping 通， node ip 或者 vip 宕机引起

以下是整理的 node 报错信息及处理

1、The connection to the server localhost:8080 was refused - did you specify the right host or port?

执行 kubectl get XXX 报错

$ kubectl get nodes

原因：node 缺少 admin.conf

解决方式：复制 master上的 admin.conf 到 node，Node 节点执行 echo "export KUBECONFIG=/etc/kubernetes/admin.conf">> ~/.bash_profile

2、kubernetes nodePort 不可访问

原因：一般是 iptables 或selinux 引起

解决方式：关闭，清空。

$ setenforce 0
$ iptables --flush
$ iptables -tnat --flush
$ service docker restart
$ iptables -P FORWARD ACCEPT
$ 重启 docker

3、Failed to start inotify_add_watch /sys/fs/cgroup/blkio: no space left on device或Failed to start inotify_add_watch /sys/fs/cgroup/cpu,cpuacct: no space left on device

原因：空间或系统参数原因

解决方式：查看磁盘空间有无 100%，执行 cat /proc/sys/fs/inotify/max_user_watches /调大数值，sysctl fs.inotify.max_user_watches=1048576

4、Failed to start reboot.target: Connection timed out

未知原因：重启报超时

解决方式：执行 systemctl --force --force reboot

5、System OOM encountered

原因：使用内存超限后，容器可能会被 Kubernetes 进行 OOMKilled

解决方式：需要调整内存，合理分配

6、Unable to register node "" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused

原因：node无法连接或拒绝连接master

解决方式：Node 节点重启 kubelet，如未恢复，需要查看 node 服务器上 cpu 内存，硬盘等资源情况

7、pod 状态一直 Terminating

ContainerGCFailed rpc error: code = DeadlineExceeded desc = context deadline exceeded

原因：可能是 17 版本 dockerd 的 BUG

解决方式：

$ systemctl daemon-reexec
$ systemctl restart docker

如不能恢复，需要升级 docker 到 18 版本

8、Container runtime is down,PLEG is not healthy: pleg was last seen active 10m ago; threshold is 3m0s

原因：Pod Lifecycle Event Generator Pod 生命周期事件生成器超时响应 RPC 调用过程中容器运行时响应超时或者节点上的 Pod 数量太多，导致 relist 无法在 3 分钟内完成

解决方式：

$ systemctl daemon-reload
$ systemctl daemon-reexec
$ systemctl restart docker

重启 Node 节点服务器，如果以上都不能解决，升级 docker 版本到最新。如果还不能解决，升级 kubernetes 到 1.16 以上版本

9、No valid private key and/or certificate found, reusing existing private key or creating a new one

原因：node 节点 kubelet 启动后，会向 master 申请 csr 证书，找不到证书

解决方式：需要在 master 上同意证书申请

10、failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps containe

原因：启用了 swap

解决方式：卸载 swap 分区后，重启 kubelet systemctl restart kubelet

11、The node was low on resource: [DiskPressure]

原因：node 的 kubelet 负责顶起采集资源占用数据，并和预先设置的 threshold 值进行比较，如果超过 threshold 值，kubelet 会杀掉一些 Pod 来回收相关资源

解决方式：修改 /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubecon-fig=/etc/kubernetes/bootstrap-kubelet.conf--kubeconfig=/etc/kubernetes/kubelet.conf"
配置参数 --eviction-hard=nodefs.available<5%，后续清理磁盘，重启 kubelet

12、Node 节点状态 unknown

查看进程，报 - bash: fork: Cannot allocate memory 错误，查看内存是否还有 free，查看 /proc/sys/kernel/pid_max 是否过小

解决方式：增加内存，或者调大 /proc/sys/kernel/pid_max

13、provided port is not in the valid range. The range of valid ports is 30000-32767

原因：超出 nodeport 端口范围，默认 nodeport 需要在 30000-32767 范围内

解决方式：修改 /etc/kubernetes/manifests/kube-apiserver.yaml，修改 --service-node-port-range= 数字，重启 apiserver

14、1 node(s) had taints that the pod didn't tolerate

原因：该节点不可调度，默认 master 不可调度

解决方式：

$ kubectl describe nodes

查看状态，kubectl taint nodes node key:NoSchedule- 删除 node 节点不可调度

四、master 故障排查处理

1、unable to fetch the kubeadm-config ConfigMap: failed to get configmap: Unauthorized

原因：token 已经过期了，token 默认是 24 小时内有效果的

解决方式：在 master 节点重新生成 token，重新 join 节点

$ kubeadm token create

$ openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | $ openssl rsa –pubin -outform der2>/dev/null | openssl dgst -sha256 -hex | sed 's/^ .* //'

2、Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

原因：权限认证报错，需要根据提示操作

解决方式：参考控制台提示

$ mkdir -p $HOME/.kube
$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$ sudo chown $(id -u):$(id -g) $HOME/.kube/config

3、Unable to update cni config: No networks found in /etc/cni/net Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message

原因：网络 CNI 找不到

解决方式:

$ sysctl net.bridge.bridge-nf-call-iptables=1

安装 flannel 或者 calico 网络

4、coredns 一直处于 Pending 或者 ContainerCreating 状态

原因：网络问题引起

解决方式：安装 flannel 或者 calico 网络，plugin flannel does not support config version，修改 /etc/cni/net.d/10-flannel.conflist，查看 cniVersion 版本号是否一致，不一致的话，修改成一致，或者 k8s 当前可支持的版本

5、WARNING IsDockerSystemdCheck

[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/

原因：缺少配置 driver systemd

解决方式：修改或创建 /etc/docker/daemon.json，增加： "exec-opts": ["native.cgroupdriver=systemd"]，重启 docker

6、WARNING FileExisting-socat

[WARNING FileExisting-socat]: socat not found in system path

原因：找不到 socat

解决方式：yum install socat

7、Permission denied cannot create /var/log/fluentd.log: Permission denied

原因：权限拒绝

解决方式：关掉 SElinux 安全导致.在 /etc/selinux/config 中将 SELINUX=enforcing 设置成 disabled，如未解决，给与目录写权限

8、启动 apiserver 失败，每次启动都是报

解决方式：需要配置 ServiceAccount Yaml 创建

原因：node 节点没有权限从 harbor 拉取镜像

解决方式：需要在 master 节点进行授权 kubectl create secret

10、etcd 启动失败

etcd: raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/xxx.tmp: is a directory

原因：etcd member 目录文件报错

解决方式：删除相关 tmp 文件和目录，重启 etcd 服务

11、etcd 节点故障

执行 etcdctl cluster-health，显示有节点unhealthy

原因：node 节点 etcd 故障了

解决方式：登录问题 node

$ systemctl stop etcd
$ systemctl restart etcd

如果还是不正常,需要删除数据

$ rm -rf /var/lib/etcd/default.etcd/member/* （记得先备份）

再重启 etcd

为避免出现一些不必要的问题，运维和开发人员应该有规范的去使用 K8s 集群，最大限度的去避免因为涉及和使用不当而引起的故障，参考以下

五、Kubernetes 使用规范

1、K8s node 节点直接实现了高可用方式，用户只需要考虑 master 的高可用企业建议使用双 master 或多 master 的架构，避免 master 单点故障

2、K8s 集群的所有节点，ntp 时间一定要校准同步

3、建议使用 OVS 或 calico 网络，不建议使用 flannel，

4、建议使用较新的稳定版本，bug 较少至少 1.12 以上，提供 ipvs 模型，非仅 ipatbles，性能决定

5、要有命名规范 Namespace, master, node , pod ,service ,ingress 都要用相应的命名规范，避免混乱

6、使用 deployment 优先，不使用 RC。支持版本回滚等功能，pod 使用多副本，replication 配置复数使用滚动升级发布

7、尽量通过 yaml 文件，或者 dashboard 去管理 k8s。不要长期直接跑命令

8、通过 yaml 文件，去限制 pod 的 cpu,内存，空间等资源

9、pod 内的端口尽量不要直接暴露在 node，应通过 service 去调取

10、云上使用 loadbalance 做 service 负载均衡自建 k8s 可以引入ingress

11、K8s 容器一定要监控建议通过 kube-prometheus 监控

12、建议部署 agent 日志服务，node agent 统一收集日志，不要用原生 k8s log，最好是使用微服务 sidecar

Kubernetes 常见故障排查和处理

一、排查命令和方式

二、pod 故障排查处理

1、查看方式

2、查看 STATUS 状态

status 定义说明：

3、查看 Conditions 状态

4、 Events报错信息整理如下

a、Failed to pull image "xxx": Error: image xxx not found

b、Warning FailedSync Error syncing pod, skipping: failed to with RunContainerError: "GenerateRun ContainerOptions: XXX not found"

c、Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "XXX" with RunContainerError: "GenerateRunContainerOptions: configmaps "XXX" not found"

d、Warning FailedMount MountVolume.SetUp failed for volume "kubernetes.io/secret/ " (spec.Name: "XXXsecret") pod with: secrets "XXXsecret" not found

e、Normal Killing Killing container with docker id XXX: pod "XXX" container "XXX" is unhealthy, it will be killed and re-created.

f、Warning FailedCreate Error creating: pods "XXXX" is forbidden:[maximum memory usage per Pod is XXX, but request is XXX, maximum memory usage per Container is XXX, but request is XXX.]

g、pod (XXX) failed to fit in any node fit failure on node (XXX): Insufficient cpu

i、FailedMount Failed to attach volume "XXX" on node "XXX" with: GCE persistent disk not found: diskName="XXX disk" zone=""

j、error: error validating "XXX.yaml": error validating data: found invalid field resources for PodSpec; if you choose to ignore these errors, turn validation off with --validate=fals

k、容器镜像不更新

l、(combined from similar events): Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with: Number of node(s) with BGP peering established = 0

m、RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed pulling image "gcr.io/google_containers/pause-amd64:": Get https://gcr.io/v1/_ping: dial tcp :443: i/o timeout

n、Warning FailedCreatePodSandBox 3m (x13 over 3m) kubelet, Failed create pod sandbox

三、node 节点故障排查处理

1、The connection to the server localhost:8080 was refused - did you specify the right host or port?

2、kubernetes nodePort 不可访问

3、Failed to start inotify_add_watch /sys/fs/cgroup/blkio: no space left on device或Failed to start inotify_add_watch /sys/fs/cgroup/cpu,cpuacct: no space left on device

4、Failed to start reboot.target: Connection timed out

5、System OOM encountered

6、Unable to register node "" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused

7、pod 状态一直 Terminating

8、Container runtime is down,PLEG is not healthy: pleg was last seen active 10m ago; threshold is 3m0s

9、No valid private key and/or certificate found, reusing existing private key or creating a new one

10、failed to run Kubelet: Running with swap on is not supported, please disable swap! or set --fail-swap-on flag to false. /proc/swaps containe

11、The node was low on resource: [DiskPressure]

12、Node 节点状态 unknown

13、provided port is not in the valid range. The range of valid ports is 30000-32767

14、1 node(s) had taints that the pod didn't tolerate

四、master 故障排查处理

1、unable to fetch the kubeadm-config ConfigMap: failed to get configmap: Unauthorized

2、Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

3、Unable to update cni config: No networks found in /etc/cni/net Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message

4、coredns 一直处于 Pending 或者 ContainerCreating 状态

5、WARNING IsDockerSystemdCheck

6、WARNING FileExisting-socat

7、Permission denied cannot create /var/log/fluentd.log: Permission denied

8、启动 apiserver 失败，每次启动都是报

9、repository does not exist or may require 'docker login': denied: requested access to the resource is denied

10、etcd 启动失败

11、etcd 节点故障

五、Kubernetes 使用规范

评论区