Kubernetes 101: Assigning Pod to Nodes (I)

HungWei Chiu
10 min readAug 5, 2023

This article aims to explore various methods in Kubernetes that can influence the assignment relationship between Pods and Nodes.

As Kubernetes can manage multiple nodes, and each node can have different characteristics, there are situations where we might want to allocate Pods to specific nodes based on:

  1. CPU/Memory resources
  2. Disk size and speed
  3. Network card speed
  4. Other specialized devices

Apart from considering the node capabilities, high availability is often a crucial factor, especially in cloud environments, where nodes may have specific Availability Zone (AZ) characteristics. Consequently, there are considerations like:

  1. Ensuring Pods are evenly distributed across different zones
  2. Deploying different services preferably within the same zone

For some providers, the consideration in (2) might be due to charging for traffic that crosses zones. Therefore, in practice, there are various scenarios that necessitate adjusting the allocation methods accordingly.

Environment

The experiments discussed next will be conducted using a Kubernetes v1.26.0 cluster generated using KIND 0.18.0. This Kubernetes cluster comprises one control node and four worker nodes. In addition to the basic labels, we have also set up labels based on zones to simulate the environment.

KIND K8s Cluster

NodeName

In general, the scheduling of Pods is determined by the Scheduler, which dynamically modifies the pod.spec.nodeName field to deploy Pods onto the selected nodes. However, we also have the option to directly set the pod.spec.nodeName field, effectively forcing the Pod to be scheduled on a specific node. This approach is the most direct and has the highest priority, bypassing the Scheduler's decision-making process. However, if the target node does not exist, the Pod will be stuck in a Pending state and cannot be deployed.

When using this approach, some important considerations are:

  1. If Kubernetes itself is configured with Cluster AutoScaler, and the nodes are generated with random names, using nodeName can become very risky.
  2. This method lacks flexibility and, therefore, is not commonly used in most situations.
apiVersion: v1
kind: Pod
metadata:
name: nodename-pod
spec:
nodeName: k8slab-worker
containers:
- name: server
image: hwchiu/netutils
Example of NodeName

Using the example mentioned above for deployment, the Pod will be directly assigned to the k8slab-worker node. When you use kubectl describe pod, you will not see any events related to the scheduler. This is because the Pod is explicitly set to run on the specified node (k8slab-worker) by setting the nodeName field, bypassing the scheduler's decision-making process.

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 90s kubelet Pulling image "hwchiu/netutils"
Normal Pulled 88s kubelet Successfully pulled image "hwchiu/netutils" in 1.582945894s (1.582973894s including waiting)
Normal Created 88s kubelet Created container server
Normal Started 88s kubelet Started container server

It is essential to be cautious when using this approach and consider other scheduling mechanisms for better flexibility and stability.

NodeSelector

Compared to the straightforward use of NodeName, NodeSelector offers a much simpler selection strategy based on Node Labels.

The usage is quite straightforward. By setting the target label in pod.spec.nodeSelector, the Scheduler will filter out all nodes that do not match this configuration.

For instance, let’s take an example where we set the node with the label kind.zone: zone1. In this scenario, all kind-worker and kind-worker2 nodes will be the final candidates for the Pod's placement.

apiVersion: apps/v1
kind: Deployment
metadata:
name: node-selector
spec:
replicas: 10
selector:
matchLabels:
app: nodeSelector
template:
metadata:
labels:
app: nodeSelector
spec:
containers:
- name: www-server
image: hwchiu/netutils
nodeSelector:
kind.zone: zone1

Deploying the aforementioned 10 Pods as an example, it can be observed that all 10 Pods will be placed within the kind.zone: zone1 zone.

It is important to note that NodeSelector’s conditions must be met. If no nodes are found that fulfill the specified requirements, the Pods will not be able to run and will remain in a Pending state.

NodeAffinity

Node Affinity can be seen as an enhanced version of NodeSelector, still based on Node Labels for node selection but offering more fine-grained operations. Besides having two different selection strategies, it provides more flexibility in the decision-making process for Labels, going beyond simple “Equal” comparisons.

Currently, Node Affinity offers two selection strategies: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution.

  • requiredDuringSchedulingIgnoredDuringExecution: Similar to the NodeSelector, this is a hard requirement. If no nodes meet the specified conditions, the Pod will be in a Pending state.
  • preferredDuringSchedulingIgnoredDuringExecution: On the other hand, this strategy is more flexible, aiming to satisfy the condition if possible. If not met, the Scheduler will fall back to the standard Scheduling process.

As these operations are closely related to Node Labels, some may wonder if dynamically adjusting Node Labels affects running Pods. This is where IgnoredDuringExecution comes into play, as it indicates that these decisions will ignore any label changes during execution. Once Pods are selected and assigned, the Scheduler will no longer interfere.

requiredDuringSchedulingIgnoredDuringExecution

It is recommended to use kubectl explain to understand any unfamiliar fields to learn its concepts and usage.

In the example below, you can observe that nodeSelectorTerms under requiredDuringSchedulingIgnoredDuringExecution describes the nodes that meet the conditions as a list. If multiple conditions are specified, they operate under the "OR" relationship, meaning that a node will be considered by the Scheduler if it satisfies any one of the specified conditions.

$ kubectl explain pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution
KIND: Pod
VERSION: v1

RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <Object>

DESCRIPTION:
If the affinity requirements specified by this field are not met at
scheduling time, the pod will not be scheduled onto the node. If the
affinity requirements specified by this field cease to be met at some point
during pod execution (e.g. due to an update), the system may or may not try
to eventually evict the pod from its node.

A node selector represents the union of the results of one or more label
queries over a set of nodes; that is, it represents the OR of the selectors
represented by the node selector terms.

FIELDS:
nodeSelectorTerms <[]Object> -required-
Required. A list of node selector terms. The terms are ORed.

Continuing down, we can see that nodeSelectorTerms supports two formats: Node Labels and Node Fields. This means that at the API level, it's not limited to using only labels.

$ kubectl explain pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms
KIND: Pod
VERSION: v1

RESOURCE: nodeSelectorTerms <[]Object>

DESCRIPTION:
Required. A list of node selector terms. The terms are ORed.

A null or empty node selector term matches no objects. The requirements of
them are ANDed. The TopologySelectorTerm type implements a subset of the
NodeSelectorTerm.

FIELDS:
matchExpressions <[]Object>
A list of node selector requirements by node's labels.

matchFields <[]Object>
A list of node selector requirements by node's fields.

Let’s now proceed to explore the introduction and usage of matchExpressions.

$ kubectl explain pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms.matchExpressions
KIND: Pod
VERSION: v1

RESOURCE: matchExpressions <[]Object>

DESCRIPTION:
A list of node selector requirements by node's labels.

A node selector requirement is a selector that contains values, a key, and
an operator that relates the key and values.

FIELDS:
key <string> -required-
The label key that the selector applies to.

operator <string> -required-
Represents a key's relationship to a set of values. Valid operators are In,
NotIn, Exists, DoesNotExist. Gt, and Lt.

Possible enum values:
- `"DoesNotExist"`
- `"Exists"`
- `"Gt"`
- `"In"`
- `"Lt"`
- `"NotIn"`

values <[]string>
An array of string values. If the operator is In or NotIn, the values array
must be non-empty. If the operator is Exists or DoesNotExist, the values
array must be empty. If the operator is Gt or Lt, the values array must
have a single element, which will be interpreted as an integer. This array
is replaced during a strategic merge patch.

Under matchExpressions, you will find the usage of three fields:

  1. Key (Label Key)
  2. Operator
  3. Value (Label Value)

These fields are used to determine whether a node meets the specified condition. In comparison to NodeSelector, this approach offers more diverse conditions, such as Exists, NotIn, Gt, and others. Please refer to the Operator definitions to learn more about each Operator and their meanings.

The following example demonstrates that the Pod must be deployed to all nodes containing the kind.zone Label.

apiVersion: apps/v1
kind: Deployment
metadata:
name: node-affinity-1
spec:
replicas: 10
selector:
matchLabels:
app: node-affinity-1
template:
metadata:
labels:
app: node-affinity-1
spec:
containers:
- name: www-server
image: hwchiu/netutils
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kind.zone
operator: Exists
values: []

The result of the deployment is as follows, with all four nodes having the kind.zone label, meeting the requirement.

Node Affinity with Required Match

As the results under nodeSelectorTerms are operated using OR logic, the following YAML example compares multiple results using the concept of OR, yielding a similar deployment outcome as mentioned earlier.

apiVersion: apps/v1
kind: Deployment
metadata:
name: node-affinity-2
spec:
replicas: 10
selector:
matchLabels:
app: node-affinity-2
template:
metadata:
labels:
app: node-affinity-2
spec:
containers:
- name: www-server
image: hwchiu/netutils
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kind.zone
operator: In
values:
- zone1
- matchExpressions:
- key: kind.zone
operator: In
values:
- zone2

Other considerations:

  • If the intention is to achieve anti-affinity, aiming to keep Pods away from specific nodes, it is necessary to utilize operators like NotIn and DoesNotExist for reverse operations.
  • NodeAffinity can be used in conjunction with NodeSelector, but both conditions must be satisfied when used together.
  • When there are multiple sets of (key/operator/value) conditions under MatchExpressions, the conditions are evaluated using AND logic.

For example, the following expression would require nodes to simultaneously have both kind.zone=Zone1 and kind.zone=Zone2. As no nodes in the testing environment satisfy this condition, all Pods will remain in a Pending state.

apiVersion: apps/v1
kind: Deployment
metadata:
name: node-affinity-3
spec:
replicas: 10
selector:
matchLabels:
app: node-affinity-3
template:
metadata:
labels:
app: node-affinity-3
spec:
containers:
- name: www-server
image: hwchiu/netutils
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kind.zone
operator: In
values:
- zone1
- key: kind.zone
operator: In
values:
- zone2
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
node-affinity-3-7c5574dd67-7spjw 0/1 Pending 0 71s
node-affinity-3-7c5574dd67-98pbj 0/1 Pending 0 71s
node-affinity-3-7c5574dd67-9bgfr 0/1 Pending 0 71s
node-affinity-3-7c5574dd67-b9bs4 0/1 Pending 0 71s
node-affinity-3-7c5574dd67-bqm8m 0/1 Pending 0 71s
node-affinity-3-7c5574dd67-glmkx 0/1 Pending 0 71s
node-affinity-3-7c5574dd67-hcmgp 0/1 Pending 0 71s
node-affinity-3-7c5574dd67-l45bv 0/1 Pending 0 71s
node-affinity-3-7c5574dd67-mn7js 0/1 Pending 0 71s
node-affinity-3-7c5574dd67-th2pd 0/1 Pending 0 71s

preferredDuringSchedulingIgnoredDuringExecution

Compared to requiredDuringSchedulingIgnoreDuringExecution, the preferred version is a preference-based setting. If there are nodes that meet the criteria, they are given preference, but it's not mandatory. As a result, unless there are Taints or node resource constraints, most of the time, Pods will not encounter a Pending state.

$ kubectl explain pod.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution
KIND: Pod
VERSION: v1

RESOURCE: preferredDuringSchedulingIgnoredDuringExecution <[]Object>

DESCRIPTION:
The scheduler will prefer to schedule pods to nodes that satisfy the
affinity expressions specified by this field, but it may choose a node that
violates one or more of the expressions. The node that is most preferred is
the one with the greatest sum of weights, i.e. for each node that meets all
of the scheduling requirements (resource request, requiredDuringScheduling
affinity expressions, etc.), compute a sum by iterating through the
elements of this field and adding "weight" to the sum if the node matches
the corresponding matchExpressions; the node(s) with the highest sum are
the most preferred.

An empty preferred scheduling term matches all objects with implicit weight
0 (i.e. it's a no-op). A null preferred scheduling term matches no objects
(i.e. is also a no-op).

FIELDS:
preference <Object> -required-
A node selector term, associated with the corresponding weight.

weight <integer> -required-
Weight associated with matching the corresponding nodeSelectorTerm, in the
range 1-100.

Here, we can observe that preferred uses the preference field to describe the node selector term. Additionally, since preference is a preference-based deployment rather than a hard requirement, it introduces the concept of weight, allowing you to adjust Pod distribution based on weight.

The following example uses two different rules and assigns different weights, aiming to deploy Pods to nodes containing kind.zone:zone2, if possible.

apiVersion: apps/v1
kind: Deployment
metadata:
name: node-affinity-4
spec:
replicas: 10
selector:
matchLabels:
app: node-affinity-4
template:
metadata:
labels:
app: node-affinity-4
spec:
containers:
- name: www-server
image: hwchiu/netutils
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kind.zone
operator: In
values:
- zone1
- weight: 4
preference:
matchExpressions:
- key: kind.zone
operator: In
values:
- zone2

The deployment result is as shown in the diagram, with the majority of Pods placed on worker3/worker4, which aligns with the expected outcome.

Node Affinity with Preferred Match

Both required and preferred configurations can be combined. For example, the following YAML achieves the following:

  1. The service can only be deployed to nodes containing kind.zone: zone1.
  2. If the node name is k8slab-worker2, give it a higher weight.
apiVersion: apps/v1
kind: Deployment
metadata:
name: node-affinity-5
spec:
replicas: 10
selector:
matchLabels:
app: node-affinity-5
template:
metadata:
labels:
app: node-affinity-5
spec:
containers:
- name: www-server
image: hwchiu/netutils
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kind.zone
operator: In
values:
- zone1
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 4
preference:
matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- k8slab-worker2
Node Affinity with both Require and Preferred

Summary

This article explores three node-centric configuration methods:

  1. NodeName
  2. NodeSelector
  3. NodeAffinity

While NodeName relies on the node’s name for configuration, the subsequent methods are based on Node Labels, providing more flexible settings. NodeAffinity can be considered an enhanced version of NodeSelector, offering both “Required” and “Preferred” modes for finer-grained operations.

The next article will continue the discussion, focusing on Inter-Pod Affinity and Pod TopologySpreadConstraints.

--

--