Skip to content

Disaster Scenarios

This page explores certain disaster scenarios, how Tanzu behaves and how to repair the environment.

Info

This page is based on vSphere IaaS Control Plane, formerly known as vSphere with Tanzu aka TKGS. You should first know about the basic concepts before reading this page.

Crash of a Worker Node on a guest cluster

Power off Worker node in vCenter

What happens if we manually power off a Virtual Machine in vCenter which is a worker node of a guest cluster?

  • the corresponding Kubernetes node enters the state NotReady
  • the Supervisor Cluster still shows the corresponding Machine (see Cluster API Machine) to be Running and VirtualMachine (a CAPV object) to be poweredOn
  • pods that should be scheduled on that node are in Pending state
  • logs in capi-controller-manager (from MachineSet Controller):

    Waiting for the Kubernetes node on the machine to report ready state
    
  • after 5 minutes, the Supervisor Cluster (actually the Cluster API controllers) deletes the Virtual Machine in vCenter and the Kubernetes nodes gets deleted

  • finally a new Virtual Machine gets created and the node joins the cluster

Delete Worker VM in vCenter (Delete from Disk)

The behavior is quite similar to powering off the VM:

  • corresponding Kubernetes node marked as NotReady
  • after a few minutes the nodes enter SchedulingDisabled, but the Supervisor Cluster still shows the Machine to be running
  • after some time, logs in capi-controller-manager:

    reason="UnhealthyNode" message="Condition Ready on node is reporting status Unknown for more than 5m0s"
    
  • after some minutes the Machine enters Deleting state

  • the Supervisor Cluster provisions a new node with the same name
  • node persists in state NoteReady,SchedulingDisabled because it is not reachable (as kubelet is not running)
  • the Machine is still in Deleting state
  • after some time the VM gets successfully deleted by CAPI
  • a new VM with new name gets provisioned and node joins the cluster and operating successfully
  • we have a functioning cluster again 13 minutes after we have deleted the VM

Crash of a Control Plane Node on a guest cluster

Power off Control Plane node of guest cluster

  • Kubernetes API Server not available for a few seconds. This is because we were connected to that node that we powered off, if we would have powered off another CP node, we wouldn't even have recognized it
  • VM powered on immediately after a few seconds
  • It is that quick that no pods get terminated

Delete CP VM in vCenter (Delete from Disk)

  • in less than a minute a new VM with the same name gets deployed in vCenter and powered on
  • the node is still marked NotReady because kubelet is not running and printing logs:

    "command failed" err="failed to load kubelet config file, error: failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file \"/var/lib/kubelet/config.yaml\", error: open /var/lib/kubelet/config.yaml: no such file or directory, path: /var/lib/kubelet/config.yaml"
    
  • this indicates that kubeadm does not join the control plane successfully to the cluster. Indeed, the cloud-init-output.log file prints

    -info ConfigMap does not yet contain a JWS signature for token ID "5j3q2p", will try again
    [2024-09-12 14:52:39] I0912 14:52:39.589511    1244 token.go:223] [discovery] The cluster-info ConfigMap does not yet contain a JWS signature for token ID "5j3q2p", will try again
    [2024-09-12 14:52:43] error execution phase preflight: couldn't validate the identity of the API Server: could not find a JWS signature in the cluster-info ConfigMap for token ID "5j3q2p"
    [2024-09-12 14:52:43] To see the stack trace of this error execute with --v=5 or higher
    [2024-09-12 14:52:43] !!! [2024-09-12T14:52:43+00:00] kubeadm reported failed action(s) for 'kubeadm join phase preflight --ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests'
    [2024-09-12 14:52:58] +++ [2024-09-12T14:52:58+00:00] running 'kubeadm join phase preflight --ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests'
    
  • deploying pods is still working, although the control plane node is in NotReady state and other etcd pods are not able to connect to this etcd instance. The other two etcd instances are logging

    "dial tcp 10.244.0.34:2380: connect: connection refused"
    
  • after a few minutes the node enters SchedulingDisabled

  • after some more minutes, the Machine enters Deleting state, but thge VirtualMachine is still poweredOn
  • after some time the VM gets deleted and a new one with a new name gets provisioned and joined to the cluster successfully
  • we have a functioning cluster again ~15 minutes after we have deleted the VM

Kubelet Crashing

Kubelet not running on a worker node

Stop kubelet on a worker node

In this scenario, we simply stopped kubelet service on a node.

The behavior is same as if you Power off a Worker node in vCenter.

Let kubelet crash on a worker node

In this scenario, we removed the kubelet config file.

The behavior is same as if you Power off a Worker node in vCenter.

Kubelet not running on a control plane node

We would guess the behaviour is similar to worker nodes, but it's not surprisingly.

Stop kubelet on a control plane node

In this scenario, we simply stopped kubelet service on a node.

The node enters NotReady state and stays there until we fix and start kubelet again. Cluster API does not recreate the node.

Let kubelet crash on a control plane node

In this scenario, we removed the kubelet config file.

The behaviour is same as if you Stop kubelet on a control plane node.

etcd not working properly

only 1 out of 3 etcd instances running

  • if you have only 2 etcd instances and you kill one of it, etcd tries to do a new leader election, also if you explicitly don't kill the leader
  • because etcd doesn't reach a quorum, as there is only 1 out of 3 instances running, the remaining instance will never become a leader
  • as a result, kube-apiserver is failing with the error watch chan error: etcdserver: no leader - apparently when apiserver receives this error, it terminates all watchers according to this description
  • the apiserver is not able to recreate the watchers because there is no etcd leader
  • Tanzu (Cluster API) does not recreate a node with a crashing etcd

As a result:

  • you are not able to communicate with the cluster using kubectl (or else to the API Server)
  • because etcd is in readonly mode, the self-healing mechanism in Kubernetes doesn't work anymore
  • running apps on the cluster are still working
  • ingress to web apps is still functioning

Kubernetes API-Server Crashing

kube-apiserver not running

  • we removed the kube-apiserver static pod on one control plane, which didn't have any effect on the cluster and it was still operational
  • we removed the kube-apiserver static pod on the second control plane, which also didn't have any effect on the cluster and it was still operational
  • after removing kube-apiserver from the third control plane, then the API Server was not reachable anymore of course, but apps were still running fine.