Kubernetes 101: Assigning Pod to Nodes (II)

12 min readAug 12, 2023

This article aims to explore the various ways within Kubernetes that can influence the allocation relationship between Pods and Nodes.

For readers who are unfamiliar with the environment and previous sections, please refer to the preceding article for context.

Kubernetes 101: Assigning Pod to Nodes (I)

Inter-Pod (Anti)Affinity

Inter-Pod Affinity and Anti-Affinity allow users to restrict Pod scheduling to specific Nodes based on the labels of Pods already running on those nodes, rather than the labels of the Nodes themselves.

So, the comparison is no longer just about Nodes but also involves the labels of Pods.

More precisely, in terms of implementation, the targeted objects for evaluation and scheduling are not Nodes themselves, but a result of grouping.

For example, we can have different grouping results as follows:

The grouping is achieved using labels on Nodes, where Nodes with the same value for a key are classified into the same group using a key/value format. The choice of the key is defined using the “topologyKey” field.

For instance, if we set topologyKey = kubernetes.io/hostname, grouping will be based on the hostname of the Nodes. As no two Nodes have the same hostname, the four Nodes will automatically form four separate groups, as shown in the first diagram. However, if we use topologyKey = kind.zone, grouping will be based on the kind.zone. Consequently, the grouping results will appear as shown in the second diagram, with a final division into two groups.

Hence, when utilizing Inter-Pod (Anti)Affinity, it’s necessary to consider how to group the Nodes. Common methods include:

Grouping by Node names.
Grouping by Zone, especially in cloud environments where services should be deployable across different Zones for high availability. It’s important to note that the official documentation mentions that if some Nodes lack the label key described by topologyKey, unexpected behavior might arise. Therefore, ensuring that all Nodes have the corresponding Label key is essential.

Similar to NodeAffinity, Inter-Pod Affinity also offers two different selection strategies: “Required” and “Preferred”.

$ kubectl explain pod.spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution
KIND:     Pod
VERSION:  v1

RESOURCE: requiredDuringSchedulingIgnoredDuringExecution <[]Object>

DESCRIPTION:
     If the affinity requirements specified by this field are not met at
     scheduling time, the pod will not be scheduled onto the node. If the
     affinity requirements specified by this field cease to be met at some point
     during pod execution (e.g. due to a pod label update), the system may or
     may not try to eventually evict the pod from its node. When there are
     multiple elements, the lists of nodes corresponding to each podAffinityTerm
     are intersected, i.e. all terms must be satisfied.

     Defines a set of pods (namely those matching the labelSelector relative to
     the given namespace(s)) that this pod should be co-located (affinity) or
     not co-located (anti-affinity) with, where co-located is defined as running
     on a node whose value of the label with key <topologyKey> matches that of
     any node on which a pod of the set of pods is running

FIELDS:
   labelSelector        <Object>
     A label query over a set of resources, in this case pods.

   namespaceSelector    <Object>
     A label query over the set of namespaces that the term applies to. The term
     is applied to the union of the namespaces selected by this field and the
     ones listed in the namespaces field. null selector and null or empty
     namespaces list means "this pod's namespace". An empty selector ({})
     matches all namespaces.

   namespaces   <[]string>
     namespaces specifies a static list of namespace names that the term applies
     to. The term is applied to the union of the namespaces listed in this field
     and the ones selected by namespaceSelector. null or empty namespaces list
     and null namespaceSelector means "this pod's namespace".

   topologyKey  <string> -required-
     This pod should be co-located (affinity) or not co-located (anti-affinity)
     with the pods matching the labelSelector in the specified namespaces, where
     co-located is defined as running on a node whose value of the label with
     key topologyKey matches that of any node on which any of the selected pods
     is running. Empty topologyKey is not allowed.

$ kubectl explain pod.spec.affinity.podAffinity.preferredDuringSchedulingIgnoredDuringExecution.podAffinityTerm
KIND:     Pod
VERSION:  v1

RESOURCE: podAffinityTerm <Object>

DESCRIPTION:
     Required. A pod affinity term, associated with the corresponding weight.

     Defines a set of pods (namely those matching the labelSelector relative to
     the given namespace(s)) that this pod should be co-located (affinity) or
     not co-located (anti-affinity) with, where co-located is defined as running
     on a node whose value of the label with key <topologyKey> matches that of
     any node on which a pod of the set of pods is running

FIELDS:
   labelSelector        <Object>
     A label query over a set of resources, in this case pods.

   namespaceSelector    <Object>
     A label query over the set of namespaces that the term applies to. The term
     is applied to the union of the namespaces selected by this field and the
     ones listed in the namespaces field. null selector and null or empty
     namespaces list means "this pod's namespace". An empty selector ({})
     matches all namespaces.

   namespaces   <[]string>
     namespaces specifies a static list of namespace names that the term applies
     to. The term is applied to the union of the namespaces listed in this field
     and the ones selected by namespaceSelector. null or empty namespaces list
     and null namespaceSelector means "this pod's namespace".

   topologyKey  <string> -required-
     This pod should be co-located (affinity) or not co-located (anti-affinity)
     with the pods matching the labelSelector in the specified namespaces, where
     co-located is defined as running on a node whose value of the label with
     key topologyKey matches that of any node on which any of the selected pods
     is running. Empty topologyKey is not allowed.

Consequently, the setup logic and format are almost identical. The only notable differences are:

It’s necessary to use topologyKey to specify how to group Nodes.
Since the decision is based on the labels of Pods, and Pods themselves have the concept of namespaces, by default, only Pods within the same namespace will be compared. If there’s a specific requirement, you can also use namespaceSelector or namespace to target a specific namespace, and all Pods within these namespaces will be considered.

Furthermore, there are some differences in implementation between Anti-Affinity and Affinity, concerning symmetry. The detailed information can be found in the official design documents.

For Anti-Affinity, if service A doesn’t want to be scheduled together with service B, it implies that service B also doesn’t want to be scheduled with service A. However, for Affinity, there isn’t this kind of symmetry. Hence, the deployment algorithms for the two have slight differences. The following excerpts are from the official design documents:

Anti-Affinity:

If service S1 has the aforementioned RequiredDuringScheduling anti-affinity rule:

If a node is empty, you can schedule either S1 or S2 onto the node.
If a node is running S1 (S2), you cannot schedule S2 (S1) onto the node.

This means that if today service A uses Anti-Affinity to limit its scheduling with service B, deploying either service A or service B will check if the rule is violated; if not, deployment proceeds without issue.

Affinity:

If service S1 has the aforementioned RequiredDuringScheduling affinity rule:

If a node is empty, you can schedule S2 onto the node.
If a node is empty, you cannot schedule S1 onto the node.
If a node is running S2, you can schedule S1 onto the node.
If a node is running both S1 and S2 and S1 terminates, S2 continues running.
If a node is running both S1 and S2 and S2 terminates, the system terminates S1 (eventually).

In the same example, if service A uses Affinity to request scheduling together with service B, according to the second rule, if service B doesn’t exist, service A will be stuck and cannot be scheduled, resulting in a Pending state.

Anti-Affinity

The first example attempts to limit deployment scenarios using the “required” approach, with self-referencing as the target.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod-affinity-1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pod-affinity-1
  template:
    metadata:
      labels:
        app: pod-affinity-1
    spec:
      containers:
        - name: www-server
          image: hwchiu/netutils
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - pod-affinity-1
            topologyKey: "kind.zone"

Based on kind.zone as the TopologyKey
Using AntiAffinity with Required, demanding that a second Pod cannot exist in the same ToplogyKey
Deploying three replicas

As described by the symmetry rule earlier, if there are no Pods satisfying the rule in the current environment, deployment can proceed freely.Therefore, the first Pod can be deployed successfully.

When deploying the second Pod, it’s observed that Pod1 has already occupied the left side, leaving only the right group available.

When deploying the third Pod, since both groups already have Pods running on top and there are no other nodes meeting the conditions in the environment, it eventually gets stuck in Pending.

From the results, it’s observed that based on TopologyKey=kind.zone classification, only two groups can be formed in the cluster. However, the third Pod won’t be able to be deployed due to the effect of AntiAffinity + Required, leading to Pending. Therefore, this usage should pay particular attention to the number of replicas and the number of groups, especially when dynamic replica adjustment via HPA is involved, which can lead to errors.

If the “required” concept is changed to “prefer,” the Pods can be spread out similarly without encountering Pending situations. This is because the conditions are not strict requirements, but rather references. Therefore, the first two Pods will try to be distributed, and the third Pod still has the possibility of being scheduled.

In this scenario, the result might look like the following diagram:

The second example involves preparing two files to simulate Anti-Affinity settings between services A and B.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod-affinity-3
spec:
  replicas: 3
  selector:
    matchLabels:
      app: pod-affinity-3
  template:
    metadata:
      labels:
        app: pod-affinity-3
    spec:
      containers:
        - name: www-server
          image: hwchiu/netutils
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - pod-affinity-4
            topologyKey: "kind.zone"

This example continues to use the Anti-Affinity + required approach for limitations, but this time the reference target is another service, pod-affinity-4. Currently, this service is not present in the environment.

There won’t be any issues with deployment, and all Pods can run normally. Next, let’s try deploying a pod-affinity-4 service without any Affinity settings.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod-affinity-4
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pod-affinity-4
  template:
    metadata:
      labels:
        app: pod-affinity-4
    spec:
      containers:
        - name: www-server
          image: hwchiu/netutils

It can be observed that now the pod-affinity-4 service is stuck in Pending.

Using “kubectl describe,” it’s evident that this is due to not satisfying the Anti-Affinity of the running Pods.

This is the symmetry described in the design document. Therefore, even if subsequent services don’t explicitly write Anti-Affinity, the scheduling process will still consider information from other running Pods to determine whether it can be scheduled.

Affinity

Next, based on the same concept, let’s test Affinity. Service A depends on Service B (pod-affinity-6).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod-affinity-5
spec:
  replicas: 5
  selector:
    matchLabels:
      app: pod-affinity-5
  template:
    metadata:
      labels:
        app: pod-affinity-5
    spec:
      containers:
        - name: www-server
          image: hwchiu/netutils
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - pod-affinity-6
            topologyKey: "kind.zone"

Unlike Anti-Affinity, if the target service is not found in the current scenario, Affinity will get stuck in a Pending state and cannot be deployed.

Note: If the reference target is oneself, it’s a special case and won’t result in a Pending situation. Otherwise, a deadlock situation may occur.

Next, deploy Service B.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod-affinity-6
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pod-affinity-6
  template:
    metadata:
      labels:
        app: pod-affinity-6
    spec:
      containers:
        - name: www-server
          image: hwchiu/netutils

After deployment, it can be observed that Services A and B almost simultaneously complete their scheduling decisions and are assigned to the same kind.zone.

During the process, initially, Service A gets stuck in a Pending state as it can’t find a matching Service B.

Meanwhile, Service B itself doesn’t have any Affinity rules defined, so it gets scheduled successfully.

When Service B is scheduled to kind.zone=zone1, all the stuck Service A instances now have a reference object to compare against, leading to the direct deployment of all Pods.

PodTopologySpread

The discussed concepts of NodeAffinity and Inter-Pod (Anti)Affinity can satisfy many users’ needs for controlling Pod scheduling. However, practical usage may encounter some issues:

Using NodeAffinity does not guarantee an even distribution of Pods across nodes; it might lead to uneven distribution scenarios (66981).
Inter-Pod Anti-Affinity encounters problems during Deployment rolling upgrades. New Pods need to be created first, but due to Anti-Affinity restrictions, no available nodes can accommodate them. As a result, the new version of Pods remains in a Pending state, leading to the entire Deployment update getting stuck (40358).

Due to the above issues, the development of Pod Topology Spread emerged. The most critical factor within Pod Topology Spread is called Skew. This value is used to address uneven Pod distribution and is defined as:

skew = Pods number matched in the current topology — min Pods matches in a topology

Whenever PodTopologySpread allocates Pods, it calculates the current Skew value for each node and uses this value to influence scheduling decisions.

The example below illustrates this:

topologyKey divides nodes into three topologies.
The current number of running Pods on each topology is 3, 2, 1 respectively.
The minimum number of running Pods in topology is 1.
Calculating the Skew value for each topology results in 2, 1, 0.

The following field description is obtained through kubectl explain. Due to space limitations, only key fields are retained here. In reality, there are many options; however, some options, even as of version 1.26, remain in Beta testing.

$ kubectl explain pod.spec.topologySpreadConstraints
KIND:     Pod
VERSION:  v1

RESOURCE: topologySpreadConstraints <[]Object>

DESCRIPTION:
     TopologySpreadConstraints describes how a group of pods ought to spread
     across topology domains. Scheduler will schedule pods in a way which abides
     by the constraints. All topologySpreadConstraints are ANDed.

     TopologySpreadConstraint specifies how to spread matching pods among the
     given topology.

FIELDS:
   labelSelector        <Object>
     LabelSelector is used to find matching pods. Pods that match this label
     selector are counted to determine the number of pods in their corresponding
     topology domain.

   maxSkew      <integer> -required-
     MaxSkew describes the degree to which pods may be unevenly distributed.
     required field. Default value is 1 and 0 is not allowed.


   topologyKey  <string> -required-
     TopologyKey is the key of node labels. Nodes that have a label with this
     key and identical values are considered to be in the same topology. We
     consider each <key, value> as a "bucket", and try to put balanced number of
     pods into each bucket. 

   whenUnsatisfiable    <string> -required-
     WhenUnsatisfiable indicates how to deal with a pod if it doesn't satisfy
     Possible enum values:
     - `"DoNotSchedule"` instructs the scheduler not to schedule the pod when
     constraints are not satisfied.
     - `"ScheduleAnyway"` instructs the scheduler to schedule the pod even if
     constraints are not satisfied.

maxSkew is used to control the upper limit of skew. If deploying Pods to a node exceeds this limit, that node is skipped.
topologyKey, similar to Inter-Pod settings, determines how nodes are classified.
whenUnsatisfiable determines what to do when maxSkew can’t find any nodes that adhere to the rule. It can either stay in a Pending state or ignore the current setting and deploy based on other configurations.
labelSelector is used to choose which Pod quantities are considered.

Therefore, it is common to use the following approach to evenly distribute Pods across different zones:

spec:
  containers:
    - name: www-server
      image: hwchiu/netutils
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kind.zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: pod-ts-1

For fine-tuning, there are additional parameters available, but before using them, please refer to the official documentation.

nodeAffinityPolicy: [Honor|Ignore]
nodeTaintsPolicy: [Honor|Ignore]
matchLabelKeys: <list>
minDomains: <integer>

Summary

This series of articles explores how to influence the Scheduler’s scheduling decisions through Kubernetes built-in methods. Such configurations can achieve better high availability settings in structures like Zones/Regions. Additionally, if Pods have strong dependencies and require low network latency, these settings can be considered.

When using these methods, it’s essential to understand the meaning of each field, especially when parameters are of List type. Confirm whether the results are based on OR or AND operations to avoid spending excessive time debugging due to unexpected outcomes not aligning with expectations.