Introduction to Kubernetes Resources, Capacity and Allocatable
Introduction
Kubernetes Pods can control their deployment and runtime conditions through Resource request/limit. Request is used to help find a node with sufficient resources, while Limit is used to ensure that the resource usage of the Pod does not exceed the upper limit.
By using “kubectl describe node”, you can observe the resource allocation on each node. There are two important concepts related to resource allocation: Capacity and Allocatable.
This article will examine the differences between these two concepts and what to pay attention to in practical applications.
For a Kubernetes node, the available resources usually include CPU, Memory, Ephemeral-Storage, etc. The total amount of resources available on the node is referred to as Capacity, while the total amount that can be allocated to Kubernetes Pods is referred to as Allocatable.
You can access this information through “kubectl describe node”. For example, the default cluster information installed using kubeadm in my VM is as follows:
Note: All environments in this article are based on Kubernetes v1.21.8.
Capacity:
cpu: 2
ephemeral-storage: 64800356Ki
hugepages-2Mi: 0
memory: 4039556Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 59720007991
hugepages-2Mi: 0
memory: 3937156Ki
pods: 110
The above information is maintained and returned by kubelet on each node. For kubelet, the relationship between Capacity and Allocatable is as follows:
System-Reserved
As a Kubernetes node, in addition to running various containers for Kubernetes application, the system itself will also have some services to run, such as sshd, dhclient, and other system services.
In terms of CPU, if all CPUs on the node are allocated for Kubernetes Pods, is it possible that there will not be enough CPU to maintain basic operations of system applications such as sshd?
The System-Reserved is designed for this scenario. Its main purpose is to let kubelet know to reserve some system resources for system-related services, and not to allocate all node resources to Pods.
- Default value: Not enabled by default unless specified
- Configuration: kubelet sets different resource amounts through “ --system-reserved”
Kube-Reserved
The concept of Kube-Reserved is exactly the same as System-Reserved, but Kube-Reserved is designed for any application that interacts with Kubernetes. The simplest example is the kubelet application.
- Default value: Not enabled by default unless specified
- Configuration: kubelet sets different resource amounts through “ -- kube-reserved”
Eviction-Thresholds
For example, when Memory usage on a node is too high, it may cause OOM situations and cause running K8s Pods on the system to be deleted by the Kernel.
In order to reduce the likelihood of this problem, kubelet will start kicking out running Pods when it finds that the node resources are not sufficient, so that the Scheduler can try to schedule the Pod to another node with more abundant resources.
Therefore, Eviction Thresholds are a resource threshold. When the system resources are below this threshold, the eviction mechanism will be triggered.
- Default value: Enabled by default
- Memory is 100MiB
- Ephemeral-Storage is 10% - Configuration: kubelet sets different resource amounts through “--eviction-hard”
Capacity:
cpu: 2
ephemeral-storage: 64800356Ki
hugepages-2Mi: 0
memory: 4039556Ki
pods: 110
Allocatable:
cpu: 2
ephemeral-storage: 59720007991
hugepages-2Mi: 0
memory: 3937156Ki
pods: 110
Let’s review the above resources again!
We are analyzing several important node resources here:
Capacity = Allocatable + System-Reserved + Kube-Reserved + Eviction-Thresholds
As mentioed eariler, System-Reserved and Kube-Reserved are disabled by default, hence
Capacity = Allocatable + 0 + 0 + Eviction-Thresholds
We can get the equation as below
Eviction-Thresholds = Capacity — Allocatable
- CPU: no difference in this case.
- Memory: 4039556Ki-3937156Ki, the difference is 102400Ki, which is 100Mi
- Ephemeral-Storage: 64800356Ki — 59720007991 Byte, first convert the former to Byte by multiplying it by 12, which gives us 66355564544, so 59720007991/66355564544 is approximately 0.8999999851, or 0.9
Therefore, the actual capacity that can be allocated is only 90%.
Note: Mi, Ki, Gi are base 1024, not 1000
The above values exactly same as the default value we mentioned in the Eviction-Thresholds section.
Experiment
Add the following content to /var/lib/kubelet/config.yaml, targeting the Kube-Reserved, System-Reserved, and evictionHard parameters:
Note: The Eviction Threshold setting is actually configured through the evictionHard parameter.
systemReserved:
memory: 500Mi
cpu: 250m
kubeReserved:
memory: 1Gi
cpu: 500m
evictionHard:
memory.available: 200Mi
nodefs.available: 20Gi
Calculate how much system resources our configuration will use, as shown in the following table. Note that Memory has Gi and Mi, and 1Gi is 1024 Mi, so the total amount is 1724 Mi.
| Type | System | Reserved | Evictionhard | Total |
| -------------| ------------------ |------------- |-------|
| CPU | 250m | 500m | 0 | 750m |
| Memory | 500Mi | 1Gi | 200Mi | 1724Mi|
| Storage | 0 | 0 | 20Gi | 20Gi |
And restart kubelet with sudo systemctl restart kubelet to load the new settings. After everything is done, use kubectl describe node to observe the changes.
Capacity:
cpu: 2
ephemeral-storage: 64800356Ki
hugepages-2Mi: 0
memory: 4039556Ki
pods: 110
Allocatable:
cpu: 1250m
ephemeral-storage: 43828836Ki
hugepages-2Mi: 0
memory: 2274180Ki
pods: 110
Now let’s calculate the value again
System-Reserved + Kube-Reserved + Eviction-Thresholds = Capacity - Allocatable
- CPU: 2 menas 2000m unit, so 2000m–1250m = 750m.
- Memory: 4039556Ki-2274180Ki = 1765376Ki, then 1765376Ki/1024 = 1724 Mi.
- Storage: 64800356Ki-43828836Ki = 20971520Ki, then 20971520Ki/1024/1024 = 20 Gi
The result of the calculation is consistent with expectations. The following diagram summarizes the above concepts.
Enforce Node Allocatable
After understanding the basic concepts and calculation methods of Capacity and Allocatable, the next step is to understand the more detailed application control.
In fact, the meanings of the system-reserved and kube-reserved parameters are “request kubelet to reserve system resources based on the parameters of system-reserved and kube-reserved to prevent Kubernetes Pods from occupying too many resources.”
Now, let’s ask a few questions to think about:
- What happens if a system application or a Kubernetes-related application uses more system resources than the set (system-reserved, kube-reserved) parameters?
- What type of application belongs to system-reserved?
- What type of application belongs to kube-reserved?
- Can self-developed applications be added to one of the categories?
By default configuration, the answers to the above questions are:
- Nothing will happen, and there will be no consequences.
- Because exceeding the limit will not have any consequences, so there is no point in classifying anyone.
- Because exceeding the limit will not have any consequences, there is no point in managing your own applications.
If you want either system-reserverd or kube-systemted application to be deleted when it uses too much, what should you do?
At this point, you need to use another parameter of kubelet --enforce-node-allocatable, which has three parameters that can be combined, namely, pods, system-reserved, and kube-reserved.
The meaning of this parameter is “which type of resource exceeds the usage and needs to be killed by the system”.
The default value is Pods, which is why if a Pod sets a request/limit and exceeds the limit, it may trigger OOM and be killed directly by the system. By default, system-reserved and kube-reserved are not set, so there is no problem if they are used beyond the limit.
In implementation, kubelet does not participate in any monitoring or deleting application decisions. It simply relies on cgroup(Control Group)to handle everything based on the settings and lets the kernel help to process.
Therefore, if you read the document, it will tell you that if you want to set system-reserved and kube-reserved in enforce-node-allocatable, you must also set the parameters --kube-reserved-cgroup and -- system-reserved-cgroup.
Added the related configs to the /var/lib/kubelet/config.yaml as below.
systemReservedCgroup: /system.slice
enforceNodeAllocatable:
- pods
- system-reserved
The above example instructs kubelet to enforce resource limits for applications belonging to the system-reserved group, which is defined as all applications under the cgroup path /system.slice.
In my Ubuntu 18.04 environment, the following applications are present under /system.slice in terms of CPU usage.
○ → lscgroup cpu:/system.slice
cpu,cpuacct:/system.slice/
cpu,cpuacct:/system.slice/irqbalance.service
cpu,cpuacct:/system.slice/systemd-update-utmp.service
cpu,cpuacct:/system.slice/vboxadd-service.service
cpu,cpuacct:/system.slice/lvm2-monitor.service
cpu,cpuacct:/system.slice/systemd-journal-flush.service
cpu,cpuacct:/system.slice/containerd.service
cpu,cpuacct:/system.slice/systemd-sysctl.service
cpu,cpuacct:/system.slice/systemd-networkd.service
cpu,cpuacct:/system.slice/systemd-udevd.service
cpu,cpuacct:/system.slice/lxd-containers.service
cpu,cpuacct:/system.slice/cron.service
cpu,cpuacct:/system.slice/sys-fs-fuse-connections.mount
cpu,cpuacct:/system.slice/networking.service
cpu,cpuacct:/system.slice/sys-kernel-config.mount
cpu,cpuacct:/system.slice/docker.service
cpu,cpuacct:/system.slice/polkit.service
cpu,cpuacct:/system.slice/systemd-remount-fs.service
cpu,cpuacct:/system.slice/networkd-dispatcher.service
cpu,cpuacct:/system.slice/sys-kernel-debug.mount
cpu,cpuacct:/system.slice/accounts-daemon.service
cpu,cpuacct:/system.slice/systemd-tmpfiles-setup.service
cpu,cpuacct:/system.slice/kubelet.service
cpu,cpuacct:/system.slice/console-setup.service
cpu,cpuacct:/system.slice/vboxadd.service
cpu,cpuacct:/system.slice/systemd-journald.service
cpu,cpuacct:/system.slice/atd.service
cpu,cpuacct:/system.slice/systemd-udev-trigger.service
cpu,cpuacct:/system.slice/lxd.socket
cpu,cpuacct:/system.slice/ssh.service
cpu,cpuacct:/system.slice/dev-mqueue.mount
cpu,cpuacct:/system.slice/ufw.service
cpu,cpuacct:/system.slice/systemd-random-seed.service
cpu,cpuacct:/system.slice/snapd.seeded.service
cpu,cpuacct:/system.slice/rsyslog.service
cpu,cpuacct:/system.slice/systemd-modules-load.service
cpu,cpuacct:/system.slice/blk-availability.service
cpu,cpuacct:/system.slice/systemd-tmpfiles-setup-dev.service
cpu,cpuacct:/system.slice/rpcbind.service
cpu,cpuacct:/system.slice/lxcfs.service
cpu,cpuacct:/system.slice/grub-common.service
cpu,cpuacct:/system.slice/ebtables.service
cpu,cpuacct:/system.slice/snapd.socket
cpu,cpuacct:/system.slice/kmod-static-nodes.service
cpu,cpuacct:/system.slice/run-rpc_pipefs.mount
cpu,cpuacct:/system.slice/lvm2-lvmetad.service
cpu,cpuacct:/system.slice/docker.socket
cpu,cpuacct:/system.slice/apport.service
cpu,cpuacct:/system.slice/apparmor.service
cpu,cpuacct:/system.slice/systemd-resolved.service
cpu,cpuacct:/system.slice/system-lvm2\x2dpvscan.slice
cpu,cpuacct:/system.slice/dev-hugepages.mount
cpu,cpuacct:/system.slice/dbus.service
cpu,cpuacct:/system.slice/system-getty.slice
cpu,cpuacct:/system.slice/keyboard-setup.service
cpu,cpuacct:/system.slice/systemd-user-sessions.service
cpu,cpuacct:/system.slice/systemd-logind.service
cpu,cpuacct:/system.slice/setvtrgb.service
As you can see from the above file names, there are many system services.
After modifying kubelet, restart it and you will find that kubelet fails to start. You will get an error message in the log indicating that the /system.slice path does not exist.
kubelet.go:1391] “Failed to start ContainerManager” err=”Failed to enforce System Reserved Cgroup Limits on “/system.slice”: [“system.slice”] cgroup does not exist”
Actually, this issue is that kubelet tries to find from many cgroup subsystems, and as long as one does not exist, it is treated as an error. According to the following source code.
func (m *cgroupManagerImpl) Exists(name CgroupName) bool {
if libcontainercgroups.IsCgroup2UnifiedMode() {
cgroupPath := m.buildCgroupUnifiedPath(name)
neededControllers := getSupportedUnifiedControllers()
enabledControllers, err := readUnifiedControllers(cgroupPath)
if err != nil {
return false
}
difference := neededControllers.Difference(enabledControllers)
if difference.Len() > 0 {
klog.V(4).InfoS("The cgroup has some missing controllers", "cgroupName", name, "controllers", difference)
return false
}
return true
}
// Get map of all cgroup paths on the system for the particular cgroup
cgroupPaths := m.buildCgroupPaths(name)
// the presence of alternative control groups not known to runc confuses
// the kubelet existence checks.
// ideally, we would have a mechanism in runc to support Exists() logic
// scoped to the set control groups it understands. this is being discussed
// in https://github.com/opencontainers/runc/issues/1440
// once resolved, we can remove this code.
allowlistControllers := sets.NewString("cpu", "cpuacct", "cpuset", "memory", "systemd", "pids")
if _, ok := m.subsystems.MountPoints["hugetlb"]; ok {
allowlistControllers.Insert("hugetlb")
}
var missingPaths []string
// If even one cgroup path doesn't exist, then the cgroup doesn't exist.
for controller, path := range cgroupPaths {
// ignore mounts we don't care about
if !allowlistControllers.Has(controller) {
continue
}
if !libcontainercgroups.PathExists(path) {
missingPaths = append(missingPaths, path)
}
}
if len(missingPaths) > 0 {
klog.V(4).InfoS("The cgroup has some missing paths", "cgroupName", name, "paths", missingPaths)
return false
}
return true
}
kubelet will search for the cpu, cpuacct, cpuset, memory, systemd, pids, and hugetlb subsystems in my system, but unfortunately the cpuset, hugetlb, and systemd subsystems do not include the /system.slice path.
Therefore, if you want to use this parameter for control, you need to properly configure the target cgroup path in order to start kubelet successfully.
○ → mkdir -p /sys/fs/cgroup/hugetlb/system.slice
○ → mkdir -p /sys/fs/cgroup/cpuset/system.slice
○ → mkdir -p /sys/fs/cgroup/systemd/system.slice
○ → systemctl restart kubelet
After everything has been executed successfully, we can use cgroup commands to check if the system-reserved resources have been set to the corresponding cgroup. Let’s review the table from earlier:
| Type | System | Reserved | Evictionhard | Total |
| -------------| ------------------ |------------- |-------|
| CPU | 250m | 500m | 0 | 750m |
| Memory | 500Mi | 1Gi | 200Mi | 1724Mi|
| Storage | 0 | 0 | 20Gi | 20Gi |
The SystemReserved CPU is 250m and the memory is 500Mi.
○ → cat /sys/fs/cgroup/cpu/system.slice/cpu.shares
256
○ → cat /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes
524288000
Here we can see that the CPU is 256, which, when calculated in units of 1024, is 25%, or 250m. The unit for memory is bytes, and 524288000/1024/1024 = 500Mi.
Summary
After all the above, let’s go back to the three questions:
- What happens if system applications or Kubernetes-related applications use more system resources than the set (system-reserved, kube-reserved)?
- What kinds of applications belong to system-reserved? What kinds of applications belong to kube-reserved?
- Can self-developed applications be added to one of these categories?
The answers are:
- It depends on whether you have used enforce-node-allocatable to ask kubelet to help kernel cgroup to control resource usage.
- Use the system-reserved-cgroup and kube-reseved-cgroup parameters to specify the cgroup path.
- Add your application to the corresponding cgroup group, and remember that different resources have different paths.
Finally, if you are interested in these settings and believe that they can better control system resource usage, please be sure to test and make sure you understand all the concepts of cgroup, in order to avoid being completely clueless when debugging in the future.
In addition, for the system usage of applications such as system-reserved and kube-reserved, please observe them for a long time with a monitoring system to get a concept before setting them. At the same time, if you enable these two settings, be mentally prepared for the possibility of these applications being removed due to OOM, so as not to be clueless when problems occur later.