Why does my 2vCPU application run faster in a VM than in a container?
With the rise and popularity of Kubernetes and container concepts, many teams not only develop and deploy new applications directly to Kubernetes but also need to migrate existing services to containers. These existing services could be deployed either on bare metal or virtual machines (VMs).
Containers emphasize “Build Once, Run Everywhere,” allowing development and operations teams to manage applications in a more lightweight and systematic way. However, it is common to observe that after moving applications as is to Kubernetes, their performance might not meet expectations.
This article primarily explores from the CPU perspective why moving services from VMs to the Kubernetes (Container) world may encounter certain issues and how these issues can lead to performance degradation. Performance bottlenecks in the application itself, such as Network I/O or Disk I/O, are beyond the scope of this article.
This article aims to address potential issues that may arise when transitioning your code from the VM world to the Kubernetes container environment. The actual performance can be influenced by various factors, with overcommitment being one of them.
CPU Throttling
Background
Traditionally, when deploying services using VMs, it was common to request a fixed amount of system resources, such as a VM with specifications like 4 vCPUs and 16 GB of RAM. Users could then connect to the VM and install and run the required applications.
However, in the Kubernetes world, the deployment of containerized applications is more dynamic. To avoid impacting other containers and depleting resources, it is essential to appropriately configure resource.request
and resource.limit
when deploying Pods.
When an application’s resource usage reaches the configured, corresponding actions are taken. For CPU, this results in CPU throttling, while for memory, it triggers the Out-Of-Memory Killer.
In cases where application developers have not properly configured resource.limit
and considered the program's operational logic, CPU throttling can be easily triggered. This often manifests as unexpected performance issues at percentiles like p95 or p99 or situations where the performance slows down as more threads are created.
In contrast, in the case of VMs, there is no need to consider such resource configuration issues, and therefore, issues like CPU throttling are not triggered, and there is no performance degradation. Next, let’s delve into what CPU throttling is and why triggering it can result in performance lower than expected.
What Is CPU Throttling
Readers familiar with documentation have likely encountered discussions about Kubernetes managing container runtime resources through cgroups. To understand CPU throttling, it’s crucial to grasp the practical relationship between cgroups and Kubernetes’ resources.request
/limit
.
CPU Request
In Kubernetes, CPU Requests serve two main purposes:
- Kubernetes aggregates the Requests of all containers in a Pod, using this value to filter nodes that don’t have sufficient resources for scheduling. This aspect is primarily related to scheduling.
- The Linux Kernel CFS (Completely Fair Scheduler) assigns CPU time to the target container.
Let’s consider the deployment of a service with three containers, each with CPU Requests of 250m, 250m, and 300m.
apiVersion: apps/v1
kind: Deployment
metadata:
name: www-deployment-resource
spec:
replicas: 1
selector:
matchLabels:
app: www-resource
template:
metadata:
labels:
app: www-resource
spec:
containers:
- name: www-server
image: hwchiu/python-example
resources:
requests:
cpu: "250m"
- name: app
image: hwchiu/netutils
resources:
requests:
cpu: "250m"
- name: app2
image: hwchiu/netutils
resources:
requests:
cpu: "300m"
Assuming a consensus that 1 vCPU equals 1000ms, the usage can be visualized as follows:
However, CPU operations don’t occur in units of 1 second (1000ms). During CFS operation, the time for each run is determined by the cgroup parameter cfs_period_us, which defaults to 100ms.
So, the above settings translated into actual runtime conditions are more like:
But CPU resources are inherently competitive. How can we ensure that these applications have at least the requested time in each period?
CFS uses the setting of CPU shares to ensure that applications can use the requested time within each period. This can be seen as the lower limit of runtime. The remaining time is left for the applications to compete for.
In the example above, the requested time for each application within a 100ms interval is:
25ms, 25ms, 30ms. The remaining 20ms is then contested among them. Therefore, in practice, various combinations are possible, such as:
Regardless of the scenario, the CPU share requirements are met, fulfilling the minimum usage conditions.
In environments using cgroup v2, the underlying implementation uses cpu.weight
.
ubuntu@hwchiu:/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podacdcc83d_4cca_4271_9145_7af6c44b1858.slice$ cat cri-containerd-*/cpu.weight
12
10
10
ubuntu@hwchiu:/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podacdcc83d_4cca_4271_9145_7af6c44b1858.slice$ cat cpu.weight
32
In the above example, CPU weights are calculated proportionally as 25:25:30, which translates to 10:10:12. The total weight is 32.
CPU Limit
CPU limit in Kubernetes is described as an upper bound on CPU usage. When the CPU usage reaches this level, it triggers CPU throttling, acting as a ceiling for CPU usage.
Modifying the above YAML to the following example, each application is given a corresponding limit:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cpu-limit
spec:
replicas: 1
selector:
matchLabels:
app: www-resource
template:
metadata:
labels:
app: www-resource
spec:
containers:
- name: www-server
image: hwchiu/python-example
resources:
requests:
cpu: "250m"
limits:
cpu: "300m"
- name: www-server2
image: hwchiu/netutils
resources:
requests:
cpu: "250m"
limits:
cpu: "300m"
- name: www-server3
image: hwchiu/netutils
resources:
requests:
cpu: "300m"
limits:
cpu: "350m"
Considering a 100ms period, the distribution of CPU usage among the applications looks like:
When an application exceeds its allocated time, it enters a throttled state, unable to continue using the CPU. In the example given, assuming the three applications want to compete for excess CPU, the result may look like:
Ultimately, the CPU will have 5ms of IDLE time because all three applications are throttled, with no excess capacity to continue running.
In cgroup v1, CPU quota (cfs_quota_us, cfs_period_us) is used to specify the allowed running time, while cgroup v2 uses cpu.max for the same purpose.
ubuntu@hwchiu:/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podacdcc83d_4cca_4271_9145_7af6c44b1858.slice$ cat cri-containerd-*/cpu.max
35000 100000
30000 100000
30000 100000
ubuntu@hwchiu:/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podacdcc83d_4cca_4271_9145_7af6c44b1858.slice$ cat cpu.max
95000 100000
In the above example, the three containers are part of the same K8s Pod, each container having CPU.MAX set to 35000(ns), 30000(ns), and 30000(ns) respectively. The Pod’s cpu.max is calculated as the sum of all container values, resulting in 95000(ns).
Based on this understanding, when a service encounters CPU throttling, it restricts the application’s ability to use CPU time within each cycle, leading to limitations even if there is idle CPU at the moment.
Thread
CPU throttling seems like a reasonable and normal behavior, but when can it lead to service performance issues, including the previously mentioned poor performance in p95, p99?
The calculation of CPU shares is based on the container process, so if the process itself generates multiple threads, the calculation involves the complete sum of all threads.
Consider deploying an application with a limit of 100ms, but its CPU workload only takes 50ms to complete:
There are no issues with basic usage. Now, if you suddenly decide to increase the number of threads and run it with two threads:
Note: The IDLE time could be consumed by other processes. However, for the purpose of this example, I assume there are no other processes in order to concentrate on the main process performance issue.
Since each thread only requires 50ms, the total is still 100ms, not exceeding the limit (quota). Therefore, there are no issues with usage.
However, what happens when you open 3 threads? (Assuming the system has at least three vCPUs available)
As the total amount is only 100ms, with three threads competing, each thread can only use an average of 33ms. This triggers the throttled mechanism, and each thread is unable to complete anything in the remaining idle time (67ms).
So, all threads must wait until the second CPU period to continue running the required tasks:
In the second period, each thread needs about 17ms, and although all three threads eventually complete successfully, each thread took 117ms (50ms of work and 67ms of idle time) from start to finish.
If you change the example to 8 threads, the situation becomes even more severe:
In this case, each thread can only use 12.5ms per CPU period, requiring at least five periods to complete all the work. Thus, what initially took 50ms now takes 412.5ms, leading to potential client request timeouts or a significant increase in latency.
Incorrect thread counts and CPU limits can potentially cause the application to trigger CPU throttling extensively. While the CPU eventually completes the tasks, the time spent on each job is extended, resulting in increased system busyness reflected in P95 and P99 performance metrics.
How To Avoid
There are at least two ways to avoid CPU throttling:
- Increase CPU Limit usage.
- Review the application’s thread usage for correctness.
Increasing CPU Limit usage is the most direct approach, but determining the appropriate value for CPU Limit is not straightforward. Generally, it is recommended to use a monitoring system to observe data and evaluate a suitable value. If you are concerned about Pod QoS, the limit setting must be consistent with the request.
Regarding the number of threads, it depends on the application developer’s attention. Thread count is typically set in two ways:
- Manually setting the required quantity.
- Automatic detection by the programming language or framework.
For (1), developers need to synchronize with Kubernetes Resource settings. For example, opening more threads based on demand may require adjusting the limit.CPU and request.CPU settings to avoid more frequent CPU throttling.
As for (2), automatic detection relies on the application developer’s familiarity with the programming language and framework in use. For example, running a container (request: 2vCPU, limit: 4vCPU) on a server with 128 vCPUs may raise questions about how these languages or frameworks automatically adjust quantities based on what criteria.
Taking Java as an example, starting from version 8u131, it has the capability to detect system resources in a container environment. Without this capability, older Java applications might set thread quantities based on physical machines (128 vCPU), making them prone to triggering throttling.
Monitor
If Prometheus is installed in the environment, three metrics can be used to observe whether the application encounters CPU throttling:
- container_cpu_cfs_throttled_seconds_total
- container_cpu_cfs_periods_total
- container_cpu_cfs_throttled_periods_total
The container_cpu_cfs_throttled_seconds_total
metric records, in seconds, how much time the target container has been throttled. It is a Counter type, so using functions like rate
is necessary to observe variations. Be cautious, as this metric returns the total time. If the application uses a high thread count, the reported seconds could be very high, requiring careful interpretation.
Therefore, it is more practical to use the last two metrics. container_cpu_cfs_periods_total
is a Counter that represents the cumulative CPU periods the container has experienced up to now (all CFS periods, typically 100ms). The other, container_cpu_cfs_throttled_periods_total
, follows the same calculation but specifically counts periods with CPU throttling.
Relative to the former, calculating the percentage of throttling can be done as container_cpu_cfs_throttled_periods_total
/ container_cpu_cfs_periods_total
. In the example below, the application needs a total of five periods, and CPU throttling occurs in four of them. Therefore, the calculated percentage is 4/5 = 80%.
It is essential to note that these metrics accumulate all data since the beginning. What matters is not all data from the start but the current operational situation. Therefore, expressions like increase
are used to calculate differences and then convert them to percentages.
For example:
sum(increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) by (container, pod, namespace)
/
sum(increase(container_cpu_cfs_periods_total[5m])) by (container, pod, namespace)
Exclusive CPU
Background
Another factor influencing CPU performance between Containers and VMs is the efficiency of CPU utilization. Consider the following scenario: there is a server with 16 vCPUs, and a service that needs to use 4 vCPUs with 4 threads is to be deployed. What differences might arise when running this service in a Container versus a VM?
The diagram below represents a server with 16 vCPUs, with each circle representing a vCPU.
For a VM, a common example is allocating a fixed 4 vCPUs dedicated to the VM. The service will then be scheduled on these 4 vCPUs by the VM’s kernel.
However, for Containers, resources are shared, and the concept of CPU (request) is to meet the usage of 4 vCPUs in each period. In practice, the specific CPUs used are not guaranteed. Therefore, for an application with 4 threads, the CPU distribution during execution might exhibit various patterns that change over time.
Why does CPU distribution affect performance? To understand this, let’s return to the basic concept of CPU. Why do we typically talk about CPUs, but in the cloud, we use the term vCPU (Virtual CPU)?
With the advancement of technology and the development of Hyper-Threading, each physical CPU can execute multiple threads (also known as virtual CPUs or threads). Conventionally, a physical CPU is referred to as a core, and there are numerous cores on a socket.
For the given machines, you can use the lscpu
command to view some details about the CPU.
ubuntu@blog-test:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 57 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
In the above example, you can see:
- There is 1 socket.
- Each socket has 16 cores.
- Each core has 2 threads.
So, the total calculated vCPUs are 1 * 16 * 2 = 32.
The following command is from a different machine:
ubuntu@hwchiu:~$ lscpu
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
Although the total is also 2 * 8 * 2 = 32, the underlying architecture is clearly different.
Due to the Socket and Core architecture, both of these scenarios can represent a server with 32 vCPUs.
In addition to the concepts mentioned above, there is now an architecture called NUMA (Non-Uniform Memory Access). NUMA divides the entire server into multiple NUMA nodes, each with multiple cores (physical CPUs). Nodes share the same memory controller. So, if there is a memory-related access requirement between cores, the speed will be fastest within the same NUMA node, and there may be some performance loss when crossing NUMA nodes.
The first example above represents a single NUMA node, while the second example has two NUMA nodes. CPU numbers 0–15 belong to the first NUMA node, and 16–31 belong to the second NUMA node. You can also use numactl
to inspect this:
ubuntu@hwchiu:~$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 16058 MB
node 0 free: 14530 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 16075 MB
node 1 free: 14284 MB
Combining this with cat /proc/cpuinfo
to view the relationship between each CPU number and core, you can roughly draw the following diagram:
In addition to the basic concepts mentioned earlier, CPU architecture has different levels of Cache between different blocks. Adding these caches to the architecture diagram:
With these basic concepts, let’s revisit the initial question: why does the CPU’s distribution affect performance? The reason is Cache + Context Switch (scheduling across different CPUs).
When vCPUs running simultaneously need to access data from each other, where is this data stored? Is there a cache between them? If these running vCPUs span different sockets or even different NUMA nodes, accessing this information may result in Cache Miss. Moreover, waiting for the next CPU period might lead to being scheduled on a different core, causing a Context Switch that can lower performance.
For most applications, this might not be noticeable, but for high-performance-demanding applications, avoiding the mentioned issues can slightly improve system performance.
Implementation
Linux has always had a tool called taskset
. According to the man page:
The taskset command is used to set or retrieve the CPU affinity of a running process given its pid, or to launch a new command with a given CPU affinity. CPU affinity is a scheduler property that “bonds” a process to a given set of CPUs on the system. The Linux scheduler will honor the given CPU affinity, and the process will not run on any other.
So, for applications on bare-metal servers, system administrators have traditionally used taskset
to bind CPUs and improve performance. However, for containers or Kubernetes, how is this handled?
In Docker, the official documentation explains that you can use “cpuset-cpus” to specify the CPU numbers to use.
— cpuset-cpus=0–2 # Limit the specific CPUs or cores a container can use.
Kubernetes noticed this issue early on and introduced the “CPU Manager” feature from Kubernetes v1.8. This feature became Beta in v1.10 and officially GA in v1.26.
The CPU Manager Policy itself supports two configurations, and the configuration occurs on the Kubelet.
- None
- Static
Since the configuration is based on the Kubelet, if you need to enable this feature, you must set it through the Kubelet’s startup parameters. Also, be aware of differences between different Kubernetes versions; refer to the official documentation to confirm which features need specific feature gates opened.
The “None” setting means no action is taken, and the CPU Manager won’t perform any special operations. This is the default value in Kubernetes.
The “Static” setting allows certain containers to run on dedicated vCPUs. This is essentially accomplished through the cpuset cgroup controller.
Conditions for containers to be considered:
- QoS is Guaranteed.
- The number of CPUs is an integer.
Since the setting requires exclusive vCPU allocation, it demands an integer unit. Otherwise, reserving 0.5 vCPU on an exclusive vCPU is a clear waste of system resources. Additionally, while exclusive system resource usage can improve container performance, it may waste vCPU resources if the service itself doesn’t fully utilize them. Therefore, it’s not applied to all Pods but specifically to Pods with QoS set to Guarantee.
When set to Static, there are additional features that can be enabled:
- full-pcpus-only
- distribute-cpus-across-numa
- align-by-socket
For example, the purpose of (1) is to make containers not exclusively use vCPUs but entire Cores, namely physical CPUs (pCPUs). The goal is to reduce noisy neighbors’ problems, avoiding performance losses caused by other vCPUs (Threads).
For instance, in the scenario without full-pcpus-only, although vCPUs are exclusively reserved, different containers may run on each Core. Since services are different, L1/L2 Cache may experience Cache Miss.
After enabling full-pcpus-only, you can exclusively reserve physical CPUs (pCPUs). This allows you to have the entire Cache for yourself, leading to improved performance. However, if the application itself only requires one vCPU, a full Core may still be allocated. In this case, exclusive reservation of two vCPUs but only using one may lead to resource waste. Therefore, careful attention to the underlying architecture and precise configuration is necessary.
Summary
When migrating applications from VMs to Kubernetes, performance issues that fall short of expectations are common. Without understanding the details, the simplest approach is to increase various resources or deploy more replicas. While these methods might alleviate symptoms, they provide limited improvement if the root problem isn’t addressed. Understanding every setting and detail inside Containers and VMs allows for a different perspective to address performance bottlenecks in the future.
Additionally, CPU throttling is not an absolute solution. There have been bugs in the Kernel causing imprecise CPU throttling, leading to unnecessary throttling for applications. When problems occur, it’s worth investigating whether the current Kernel has related bugs.
Reference
1. https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/3570-cpumanager/README.md
2. https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2625-cpumanager-policies-thread-placement
3. https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2902-cpumanager-distribute-cpus-policy-option
4. https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108
5. https://www.datadoghq.com/blog/kubernetes-cpu-requests-limits/
6. https://www.cnblogs.com/charlieroro/p/17074808.html