Handling Pods When Nodes Fail

11 min readAug 27, 2023

Introduction

In addition to the basic Pod types, Kubernetes offers a variety of higher-level workload types, such as Deployment, DaemonSet, and StatefulSet. These higher-level controllers allow you to provide services with multiple replicas of your Pods, making it easier to achieve a high availability architecture.

However, when Kubernetes nodes experience failures such as crashes, network disruptions, or system failures, what happens to the Pods running on those nodes?

From a high availability perspective, some might think that having multiple replicas of an application ensures that the service remains unaffected by node failures. However, in certain cases where the application belongs to a StatefulSet, horizontal scaling isn’t an option. In such scenarios, it becomes necessary to quickly reschedule the related Pods to maintain service availability in the event of node failures.

Experimental

Environment KIND (Kubernetes in Docker) is based on Docker, and for this article, we’ll set up an experimental environment on Ubuntu 20.04 using KIND to create a three-node Kubernetes cluster. One node will serve as the control plane, while the other two will act as regular worker nodes.

Nodes

Kubelet itself acts as an agent on the nodes. It periodically calculates and reports the node’s status to the control plane. There are two distinct implementations of Kubelet, with the watershed being around Kubernetes version 1.17.

First, let’s understand what information Kubelet needs to report to the control plane. Take the command kubectl get nodes k8slab-worker2 -o yaml as an example:

status:
  addresses:
  - address: 172.18.0.2
    type: InternalIP
  - address: k8slab-worker2
    type: Hostname
  allocatable:
    cpu: "4"
    ephemeral-storage: 30298176Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 16386892Ki
    pods: "110"
  capacity:
    cpu: "4"
    ephemeral-storage: 30298176Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 16386892Ki
    pods: "110"
  conditions:
  - lastHeartbeatTime: "2023-08-27T03:48:19Z"
    lastTransitionTime: "2023-08-23T17:07:50Z"
    message: kubelet has sufficient memory available
    reason: KubeletHasSufficientMemory
    status: "False"
    type: MemoryPressure
  - lastHeartbeatTime: "2023-08-27T03:48:19Z"
...

Essentially, all the information within the status field needs to be reported by the Kubelet. The amount of information transferred is quite extensive.

From the perspective of information returned by the Kubelet, it can be categorized into two types of reporting:

Reporting the content of the status
Updating the health of the heartbeat

In this article, the focus is on the response when a node is marked as “NotReady,” which corresponds to the information related to (2), the so-called heartbeat information.

As documented in the official documentation, node heartbeats can be categorized into two methods:

Updates to the .status of a Node
Lease objects within the kube-node-lease namespace. Each Node has an associated Lease object.

Next, let’s delve into the concepts and relevant configuration files for these two implementation methods.

NodeStatus

As mentioned earlier, the Kubelet needs to provide various operational status updates and heartbeat information about the nodes. Initially, Kubernetes combined these two pieces of information and updated them together. The default update frequency for this combination was set to 10 seconds and could be adjusted through the nodeStatusUpdateFrequency field in KubeletConfiguration.

Furthermore, for each status update, the kubelet has a built-in retry mechanism. It attempts to transmit the information five times by default, and this value is currently hardcoded in the code (kubelet.go).

const (
 // nodeStatusUpdateRetry specifies how many times kubelet retries when posting node status failed.
 nodeStatusUpdateRetry = 5
    ....
)

func (kl *Kubelet) updateNodeStatus(ctx context.Context) error {
 klog.V(5).InfoS("Updating node status")
 for i := 0; i < nodeStatusUpdateRetry; i++ {
  if err := kl.tryUpdateNodeStatus(ctx, i); err != nil {
   if i > 0 && kl.onRepeatedHeartbeatFailure != nil {
    kl.onRepeatedHeartbeatFailure()
   }
   klog.ErrorS(err, "Error updating node status, will retry")
  } else {
   return nil
  }
 }
 return fmt.Errorf("update node status exceeds retry count")
}

The following diagram illustrates this update process:

However, this implementation led to performance bottlenecks in practice. With a significant amount of status information being reported by kubelets every ten seconds, especially in large clusters, it caused stress on the etcd backend, resulting in decreased cluster performance. Consequently, starting from version 1.13, a new implementation was adopted, and it was officially declared as stable in version 1.17.

Lease

To improve the efficiency and performance issues of the NodeStatus updates, a Proposal was introduced, aiming to utilize the Lease structure. The core concept was to separate the NodeStatus and Heartbeat information, handling these two aspects independently.

The heartbeat itself carries a minimal traffic load, maintaining its historical frequency without causing significant performance issues. On the other hand, NodeStatus information is relatively extensive, and therefore, its update frequency is adjusted.

Heartbeat

In terms of Heartbeat, Kubernetes adopts the built-in API Lease structure. When this framework is operational, a namespace named kube-node-lease is automatically created. Each Kubernetes node corresponds to a Lease object with the same name.

azureuser@course:~$ kubectl -n kube-node-lease get lease
NAME                   HOLDER                 AGE
k8slab-control-plane   k8slab-control-plane   3d13h
k8slab-worker          k8slab-worker          3d13h
k8slab-worker2         k8slab-worker2         3d13h
azureuser@course:~$ kubectl -n kube-node-lease get lease k8slab-worker -o yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2023-08-23T17:08:00Z"
  name: k8slab-worker
  namespace: kube-node-lease
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: k8slab-worker
    uid: 70c3c25d-dc3d-4ad3-ba3a-36637d3b5b60
  resourceVersion: "623846"
  uid: 08013bd8-2dd9-45fe-ad6d-38b77253b437
spec:
  holderIdentity: k8slab-worker
  leaseDurationSeconds: 60
  renewTime: "2023-08-27T03:21:06.584188Z"bas

These Lease objects use renewTime and holderIdentity to represent the last time each node was updated. These times are subsequently utilized by the Controller to determine whether the node is Ready/NotReady.

According to the kubelet source code, the calculation for renewInterval is nodeLeaseRenewIntervalFraction multiplied by NodeLeaseDurationSeconds. The former is a fixed constant of 0.25, while the latter, as mentioned in kubelet's description of nodeLeaseDurationSeconds, has a default value of 40.

const (
 // nodeLeaseRenewIntervalFraction is the fraction of lease duration to renew the lease
 nodeLeaseRenewIntervalFraction = 0.25
)
...

leaseDuration := time.Duration(kubeCfg.NodeLeaseDurationSeconds) * time.Second
renewInterval := time.Duration(float64(leaseDuration) * nodeLeaseRenewIntervalFraction)klet.nodeLeaseController = lease.NewController())

Based on this calculation, we get 0.25 * 40 = 10, indicating that kubelet updates every 10 seconds.

Node Updating frequency = nodeLeaseDurationSeconds * 0.25

Using the following command, you can attempt to observe changes in the Lease object:

kubectl -n kube-node-lease get lease k8slab-worker2 -o yaml -w

---
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2023-08-23T17:08:00Z"
  name: k8slab-worker2
  namespace: kube-node-lease
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: k8slab-worker2
    uid: 5ad224c5-11ad-4939-8cfa-0066eb86d6b9
  resourceVersion: "655385"
  uid: bf558925-25b8-4483-9f9d-4a78521afa4c
spec:
  holderIdentity: k8slab-worker2
  leaseDurationSeconds: 40
  renewTime: "2023-08-27T07:39:35.899240Z"
---
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2023-08-23T17:08:00Z"
  name: k8slab-worker2
  namespace: kube-node-lease
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: k8slab-worker2
    uid: 5ad224c5-11ad-4939-8cfa-0066eb86d6b9
  resourceVersion: "655405"
  uid: bf558925-25b8-4483-9f9d-4a78521afa4c
spec:
  holderIdentity: k8slab-worker2
  leaseDurationSeconds: 40
  renewTime: "2023-08-27T07:39:45.982209Z"

From the updates in the object, observing the difference in renewTime, which is 07:39:45 and 07:39:35, the interval is indeed 10 seconds, consistent with the theory.

Trying to change the leaseDurationSeconds to 60 seconds and observe the changes in renewTime:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2023-08-23T17:08:00Z"
  name: k8slab-worker
  namespace: kube-node-lease
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: k8slab-worker
    uid: 70c3c25d-dc3d-4ad3-ba3a-36637d3b5b60
  resourceVersion: "654971"
  uid: 08013bd8-2dd9-45fe-ad6d-38b77253b437
spec:
  holderIdentity: k8slab-worker
  leaseDurationSeconds: 60
  renewTime: "2023-08-27T07:36:14.454825Z"
---
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  creationTimestamp: "2023-08-23T17:08:00Z"
  name: k8slab-worker
  namespace: kube-node-lease
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: k8slab-worker
    uid: 70c3c25d-dc3d-4ad3-ba3a-36637d3b5b60
  resourceVersion: "655003"
  uid: 08013bd8-2dd9-45fe-ad6d-38b77253b437
spec:
  holderIdentity: k8slab-worker
  leaseDurationSeconds: 60
  renewTime: "2023-08-27T07:36:29.654757Z"

You can see two times: 07:36:29 and 07:36:14, with an interval of 15 seconds, aligning with the calculated theory.

Through the Lease structure, this method is used to update the latest heartbeat status of nodes. How the Controller utilizes this information to determine whether nodes are Ready/NotReady will be further explained.

The following diagram summarizes the update process of Heartbeat under the Lease structure:

Status

To address the issue of frequent Status data transmission causing stress on etcd, the new architecture adjusts the mechanism of Status transmission. The Status phase is divided into two stages:

Computation
Reporting

The computation phase involves collecting and summarizing the current node’s information. The reporting phase involves sending this information to the API Server. These two stages operate independently, leading to different operating cycles.

In the computation phase, by default, calculations are done every 10 seconds. In the reporting phase, there are two steps:

If meaningful updates are derived from calculations, they are immediately reported to the API Server.
Otherwise, there is a wait of 5 minutes before updating the API Server.

According to the KubeletConfiguration explanation, you can adjust the respective frequencies using the variables nodeStatusUpdateFrequency and nodeStatusReportFrequency.

For nodeStatusUpdateFrequency.

nodeStatusUpdateFrequency is the frequency that kubelet computes node status.
If node lease feature is not enabled, it is also the frequency that kubelet posts node status to master.
Note: When node lease feature is not enabled, be cautious when changing the constant,
it must work with nodeMonitorGracePeriod in nodecontroller. Default: "10s"

And for nodeStatusReportFrequency

nodeStatusReportFrequency is the frequency that kubelet posts
node status to master if node status does not change. Kubelet will ignore this frequency and post node status immediately if any
change is detected. It is only used when node lease feature is enabled. nodeStatusReportFrequency's default value is 5m.
But if nodeStatusUpdateFrequency is set explicitly, nodeStatusReportFrequency's default value will be set
to nodeStatusUpdateFrequency for backward compatibility. Default: "5m"

According to the kubelet’s related code, each NodeStatus update attempt involves comparing with the current object. Information is only sent to the API Server when there are changes in the object or when the time exceeds nodeStatusReportFrequency.

originalNode, err := kl.heartbeatClient.CoreV1().Nodes().Get(ctx, string(kl.nodeName), opts)
if err != nil {
    return fmt.Errorf("error getting node %q: %v", kl.nodeName, err)
}
if originalNode == nil {
    return fmt.Errorf("nil %q node object", kl.nodeName)
}

node, changed := kl.updateNode(ctx, originalNode)
shouldPatchNodeStatus := changed || kl.clock.Since(kl.lastStatusReportTime) >= kl.nodeStatusReportFrequency

This mechanism reduces the overall update frequency and mitigates the performance impact of frequent updates.

When combining both mechanisms, the schematic diagram is as follows:

Under default settings, the operational logic diagram appears as follows: Kubelet now generates two distinct paths, one responsible for Status and the other for Heartbeat.

Controller Manager

The discussion above explained how Kubelet reports nodes to the control plane, and the actual determination of whether a node is Ready/NotReady is performed by the node lifecycle controller within the Controller.

The concept is simple: periodically check the status of each node’s corresponding Lease object. If the Lease object hasn’t been renewed for a certain period, the node is considered NotReady due to the lack of recent information updates.

There are two parameters associated with this concept:

How often does the Controller check the Lease object?
How long does it take for the Renew time to be considered as NotReady?

According to the kube-controller-manager documentation, these two parameters can be adjusted: --node-monitor-period and --node-monitor-grace-period.

For Node-Monitor-Period

The period for syncing NodeStatus in cloud-node-lifecycle-controller. default: 5s

and For Node-Monitor-Grace-Period

Amount of time which we allow running Node to be unresponsive before marking it unhealthy. Must be N times more than kubelet’s nodeStatusUpdateFrequency, where N means number of retries allowed for kubelet to post node status. default: 40s

For node-monitor-period, the default is 5 seconds, and we can check it from the below code.

...
// Incorporate the results of node health signal pushed from kubelet to master.
go wait.UntilWithContext(ctx, func(ctx context.Context) {
    if err := nc.monitorNodeHealth(ctx); err != nil {
        logger.Error(err, "Error monitoring node health")
    }
}, nc.nodeMonitorPeriod)
...

By using a goroutine, the monitorNodeHealth function is executed every nodeMonitorPeriod second to check the node's status.

For node-monitor-grace-period, the default is 40 seconds. The Controller's source code indicates this configuration.

func (nc *Controller) tryUpdateNodeHealth(ctx context.Context, node *v1.Node) (time.Duration, v1.NodeCondition, *v1.NodeCondition, error) {
...
	if currentReadyCondition == nil {
...
	} else {
		// If ready condition is not nil, make a copy of it, since we may modify it in place later.
		observedReadyCondition = *currentReadyCondition
		gracePeriod = nc.nodeMonitorGracePeriod
	}
...
	if nc.now().After(nodeHealth.probeTimestamp.Add(gracePeriod)) {
		// NodeReady condition or lease was last set longer ago than gracePeriod, so
		// update it to Unknown (regardless of its current value) in the master.
....
	}
....
}

The given code shows that the configured nodeMonitorGracePeriod is assigned to the local variable gracePeriod. It then checks if the current time has exceeded lease + gracePeriod. If it has, the node is set to NotReady, with the reason set as "Unknown".

status:
  conditions:
  - lastHeartbeatTime: "2023-08-27T14:03:45Z"
    lastTransitionTime: "2023-08-27T14:04:30Z"
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: Ready

To integrate these concepts, the resulting conceptual diagram shows that Kubelet and Controller work asynchronously, with one responsible for updating the status and the other for confirming and updating the status.

The overall workflow logic can be summarized in the following diagram.

Evict Pod

The discussion so far explained how Kubernetes identifies a node as faulty (NotReady). But when a node becomes NotReady, how long does it take for the Pods running on the node to be rescheduled?

In this regard, Kubernetes utilizes the Taint/Toleration mechanism to achieve automated rescheduling, called Taint Based Evictions

When a node is deemed faulty, it is automatically tainted with a node.kubernetes.io/not-ready Taint. Each Pod can use Toleration along with tolerationSeconds to determine how long it can tolerate the situation. Once that time expires, the Pod is automatically rescheduled.

API-Server documentation reveals two related settings.

For default-not-ready-toleration-seconds

Default: 300
Indicates the tolerationSeconds of the toleration for notReady:NoExecute
that is added by default to every pod that does
not already have such a toleration.

And for default-unreachable-toleration-seconds

Default: 300
Indicates the tolerationSeconds of the toleration for unreachable:NoExecute
that is added by default to every pod that does
not already have such a toleration.

Both of these settings default to 300 seconds. This means that when a node is marked as damaged, the running Pods will remain alive for at least 300 seconds before being removed and rescheduled.

This value can be set independently for each Pod, as seen in the example.

tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 20

Therefore, if you want Pods to respond more quickly to node failures and be rescheduled faster, you can adjust the tolerationSeconds during deployment. Additionally, you can modify the parameters of kubelet/controller to recognize node damage more quickly and trigger the rescheduling behavior.

Summary

Kubelet uses the Lease mechanism to report heartbeat after version 1.17.
Kubelet and Controller work asynchronously, one reporting and the other monitoring. Pay attention to the timeout settings between them.
nodeLeaseDurationSeconds on kubelet determines how often the Lease object is updated; the calculated value times 0.25 is the actual seconds.
node-monitor-period and node-monitor-grace-period on the Controller determine how often the Controller checks and how long until it's considered NotReady.
By default, it takes at least 40 seconds to detect a node failure.
By default, a Pod can survive on a faulty node for 300 seconds.
By default, a Pod takes a minimum of 340 seconds to be rescheduled from a faulty node.
Pods can adjust their reaction time through Taint-Based Eviction.