记一次k8s节点NotReady故障排查

故障发现

1
2
kubectl get node
登录k8s集群发现一台node处于NotReady状态,其节点已经不可以调用。

故障分析

1
2
3
4
5
6
7
8
9
10
a. 节点状态查询
ping测节点,判定其是否存活;登录节点查看节点状态,如负载、内存、硬盘等。均正常
b. 节点kubelet核查
查看节点kubelet服务状态,systemctl status kubelet。均正常
c. 查看节点kubelet相关日志
以上排查均正常,故看下kubelet日志,journalctl -fu kubelet
发现有节点not found报错

# 总结
此集群是测试共享集群,不知被谁将节点名称修改,导致集群节点无法被调度。需重新将节点加入集群

故障处理思路

1
2
3
4
5
a. 将对应节点打污点,驱逐其上pod资源
b. 删除此节点
c. 重置此节点
d. master上重新生成join token
e. 节点加入集群

故障处理

  • 驱逐节点资源
1
2
3
4
5
6
7
8
9
10
# master上操作
[root@k8s-master ~]# kubectl drain k8a-node-1 --delete-local-data --force --ignore-daemonsets

node/k8a-node-1 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/kube-flannel-ds-pzr8q, kube-system/kube-proxy-tts9k, longhorn/longhorn-manager-sl2nr
evicting pod kubernetes-dashboard/dashboard-metrics-scraper-7b59f7d4df-n4d5f
evicting pod default/rc-demo-5c4bm
evicting pod default/rc-demo-4sbk6
evicting pod default/rc-demo-hprf5
evicting pod default/rc-demo-m8w8k
  • 删除节点

    1
    2
    3
    # master上操作
    [root@k8s-master ~]# kubectl delete node k8a-node-1
    node "k8a-node-1" deleted
  • 重置节点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 对应node节点上操作
[root@k8s-node-1 ~]# kubeadm reset
[reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W1122 09:57:15.005549 13487 removeetcdmember.go:79] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] No etcd config found. Assuming external etcd
[reset] Please, manually reset etcd to prevent further issues
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]
[reset] Deleting contents of stateful directories: [/var/lib/kubelet /var/lib/dockershim /var/run/kubernetes /var/lib/cni]

The reset process does not clean CNI configuration. To do so, you must remove /etc/cni/net.d
The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually by using the "iptables" command.
If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

The reset process does not clean your kubeconfig files and you must remove them manually.
Please, check the contents of the $HOME/.kube/config file.
  • 重新获取join token
1
2
3
4
# master上执行
[root@k8s-master ~]# kubeadm token create --print-join-command

kubeadm join 172.20.5.123:6443 --token r9la1w.pzopz4pu0nnl08m3 --discovery-token-ca-cert-hash sha256:28d2b6b52ff7448ced872f97290289278fcf12d03d25b65be33ae5e4909c68d6
  • 节点重新加入集群
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 对应node上操作
[root@k8s-node-1 ~]# kubeadm join 172.20.5.123:6443 --token r9la1w.pzopz4pu0nnl08m3 --discovery-token-ca-cert-hash sha256:28d2b6b52ff7448ced872f97290289278fcf12d03d25b65be33ae5e4909c68d6
[preflight] Running pre-flight checks
[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 20.10.2. Latest validated version: 19.03
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

验证

1
2
3
4
5
6
7
[root@k8s-master ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
k8a-node-2 Ready <none> 310d v1.20.2
k8s-master Ready control-plane,master 310d v1.20.2
k8s-node-1 Ready <none> 31s v1.20.2

# 继续将node2也这样操作即可
-------------本文结束感谢您的阅读-------------
原创技术分享,感谢您的支持。