How to use requests and limits for the JVM heap heap with Kubernetes memory and stay out of trouble.

Running Java applications in a container environment requires an understanding of both – JVM memory mechanisms and Kubernetes memory management. These two environments working together result in a stable application, however, misconfiguration can lead to infrastructure overspending at best, and unstable or crashing applications in the worst case. We'll first take a closer look at how JVM memory works, then we'll move on to Kubernetes, and finally, we'll put these two concepts together.

Introduction to the JVM memory model

JSON memory management is a highly complex mechanism that has been continuously improved over the years through continuous releases and is one of the strengths of the JVM platform. For this article, we'll cover only the basics that are useful for this topic. At a high level, JVM memory consists of two spaces – Heap and Metaspace.

Tricky memory management for Java applications on Kubernetes

Non-heap memory

The JVM uses many memory regions. The most notable is Metaspace. Metaspace has several features. It is primarily used as a method area where the application's class structure and method definitions are stored, including the standard library. Memory pools and constant pools are used for immutable objects, such as strings, as well as class constants. The stack area is a last-in, first-out structure for thread execution, storing primitives and references to objects passed to functions. Depending on the JVM implementation and version, some details of the purpose of this space may vary.

I like to think of the Metaspace space as a management area. The size of this space can range from a few to several hundred megabytes, depending on the size of the codebase and its dependencies, and remains virtually constant throughout the life of the application. By default, this space is unbound and expands as needed by the application.

Metaspace was introduced in Java 8 to replace Permanent Generation, which had garbage collection issues.

Some other non-heap memory areas worth mentioning are code caching, threading, garbage collection. More on non-heap memory refer here.

Heap heap memory

If Metaspace is the management space, then Heap is the action space. This is where all instance objects are stored, and garbage collection is most active here. The size of this memory varies from application to application, depending on the size of the workload—the application needs the memory required to satisfy individual request and traffic characteristics. Large applications often have heap sizes in GB.

We'll use a sample application to explore memory mechanisms. The source code is here.

This demo application simulates a real-world scenario in which a system serving an incoming request accumulates objects on the heap and becomes a candidate for garbage collection after the request completes. At its core, the program is an infinite loop that creates large objects on the heap by adding large objects to the list and periodically clearing the list.

val list = mutableListOf<ByteArray>()

generateSequence(0) { it + 1 }.forEach {
    if (it % (HEAP_TO_FILL / INCREMENTS_IN_MB) == 0) list.clear()
    list.add(ByteArray(INCREMENTS_IN_MB * BYTES_TO_MB))
}

The following is the output of the application. At a preset interval (in this example, the 350MB heap size), the state is cleared. It is important to understand that clearing the state does not empty the heap - this is the decision made by the internal implementation of the garbage collector when to evict objects from memory. Let's run this application with a few heap settings to see how they affect JVM behavior.

First, we'll use a maximum heap size of 4 GB (controlled by the -Xmx flag).

~ java -jar -Xmx4G app/build/libs/app.jar

INFO           Used          Free            Total
INFO       14.00 MB      36.00 MB       50.00 MB
INFO       66.00 MB      16.00 MB       82.00 MB
INFO      118.00 MB     436.00 MB      554.00 MB
INFO      171.00 MB     383.00 MB      554.00 MB
INFO      223.00 MB     331.00 MB      554.00 MB
INFO      274.00 MB     280.00 MB      554.00 MB
INFO      326.00 MB     228.00 MB      554.00 MB
INFO  State cleared at ~ 350 MB.
INFO           Used          Free            Total
INFO      378.00 MB     176.00 MB      554.00 MB
INFO      430.00 MB     208.00 MB      638.00 MB
INFO      482.00 MB     156.00 MB      638.00 MB
INFO      534.00 MB     104.00 MB      638.00 MB
INFO      586.00 MB      52.00 MB      638.00 MB
INFO      638.00 MB      16.00 MB      654.00 MB
INFO      690.00 MB      16.00 MB      706.00 MB
INFO  State cleared at ~ 350 MB.
INFO           Used          Free            Total
INFO      742.00 MB      16.00 MB      758.00 MB
INFO      794.00 MB      16.00 MB      810.00 MB
INFO      846.00 MB      16.00 MB      862.00 MB
INFO      899.00 MB      15.00 MB      914.00 MB
INFO      951.00 MB      15.00 MB      966.00 MB
INFO     1003.00 MB      15.00 MB     1018.00 MB
INFO     1055.00 MB      15.00 MB     1070.00 MB
...
...

Interestingly, although the state has been cleared and is ready for garbage collection, you can see that the memory used (the first column) is still growing. Why is that? Because the heap has enough space to scale, the JVM delays garbage collection that typically requires significant CPU resources and optimizes itself to service the main thread. Let's see how different heap sizes affect this behavior.

~ java -jar -Xmx380M app/build/libs/app.jar

INFO           Used          Free            Total
INFO       19.00 MB     357.00 MB      376.00 MB
INFO       70.00 MB     306.00 MB      376.00 MB
INFO      121.00 MB     255.00 MB      376.00 MB
INFO      172.00 MB     204.00 MB      376.00 MB
INFO      208.00 MB     168.00 MB      376.00 MB
INFO      259.00 MB     117.00 MB      376.00 MB
INFO      310.00 MB      66.00 MB      376.00 MB
INFO  State cleared at ~ 350 MB.
INFO           Used          Free            Total
INFO       55.00 MB     321.00 MB      376.00 MB
INFO      106.00 MB     270.00 MB      376.00 MB
INFO      157.00 MB     219.00 MB      376.00 MB
INFO      208.00 MB     168.00 MB      376.00 MB
INFO      259.00 MB     117.00 MB      376.00 MB
INFO      310.00 MB      66.00 MB      376.00 MB
INFO      361.00 MB      15.00 MB      376.00 MB
INFO  State cleared at ~ 350 MB.
INFO           Used          Free            Total
INFO       55.00 MB     321.00 MB      376.00 MB
INFO      106.00 MB     270.00 MB      376.00 MB
INFO      157.00 MB     219.00 MB      376.00 MB
INFO      208.00 MB     168.00 MB      376.00 MB
INFO      259.00 MB     117.00 MB      376.00 MB
INFO      310.00 MB      66.00 MB      376.00 MB
INFO      361.00 MB      15.00 MB      376.00 MB
INFO  State cleared at ~ 350 MB.
INFO           Used          Free            Total
INFO       55.00 MB     321.00 MB      376.00 MB
INFO      106.00 MB     270.00 MB      376.00 MB
INFO      157.00 MB     219.00 MB      376.00 MB
INFO      208.00 MB     168.00 MB      376.00 MB
...
...

In this case, we allocated just enough heap size (380 MB) to handle the request. We can see that under these constraints, GC starts immediately to avoid the dreaded out of memory error. This is the promise of the JVM - it will always attempt garbage collection before failing due to insufficient memory. For completeness, let's take a look at how it actually works:

~ java -jar -Xmx150M app/build/libs/app.jar

INFO           Used          Free            Total
INFO       19.00 MB     133.00 MB      152.00 MB
INFO       70.00 MB      82.00 MB      152.00 MB
INFO      106.00 MB      46.00 MB      152.00 MB
Exception in thread "main"
...
...
Caused by: java.lang.OutOfMemoryError: Java heap space
 at com.dansiwiec.HeapDestroyerKt.blowHeap(HeapDestroyer.kt:28)
 at com.dansiwiec.HeapDestroyerKt.main(HeapDestroyer.kt:18)
 ... 8 more

For a maximum heap size of 150 MB, the process cannot handle a workload of 350MB and fails when the heap is filled, but does not fail until the garbage collector attempts to salvage the situation.

Let's also look at the size of the Metaspace. For this, we'll use jstat (output omitted for brevity)

~ jstat -gc 35118

MU
4731.0

The output indicates that the Metaspace utilization is approximately 5 MB. Keeping in mind that Metaspace is responsible for storing class definitions, as an experiment, let's add the popular Spring Boot framework to our applications.

~ jstat -gc 34643

MU
28198.6

Metaspace jumped to almost 30 MB because the classloader took up much more space. For larger applications, it is not uncommon for this space to take up more than 100 MB. Let's move into the Kubernetes space.

Kubernetes memory management

Kubernetes memory control operates at the operating system level, in contrast to the JVM that manages the memory allocated to it. The goal of the K8s memory management mechanism is to ensure that workloads are scheduled to well-resourced nodes and kept within certain limits.

When defining a workload, users can operate on two parameters — requests and limits. These are defined at the container level, but, for simplicity, we'll consider it in terms of pod parameters, which are just the sum of the container settings.

When a pod is requested, kube-scheduler, a component of the control plane, looks at the resource request and selects a node with enough resources to hold the pod. Once scheduled, pods are allowed to exceed their memory requests (as long as the node has free memory) but prohibited from exceeding their limits.

Kubelet (container runtime on a node) monitors the memory utilization of pods and if the memory limit is exceeded, it restarts the pod or evicts the node completely from the node if it runs low on resources (see the official documentation on this topic for more details.) This results in the infamous OOMKilled (out of memory) pod state.

An interesting scenario arises when a pod stays within its limits, but exceeds the node's available memory. This is possible because the dispatcher looks at the pod's requests (rather than throttling) to dispatch it to the node. In this case, the kubelet performs a process called node stress eviction. In short, this means that the pod is terminating in order to reclaim resources on the node. Depending on how bad the resource state on the node is, the eviction can be soft (allowing the pod to terminate gracefully) or hard. This scenario is shown in the following figure.

There must be much more to learn about the inner workings of eviction. For more information on this complex process, click here. Let's stop at this story and look at how these two mechanisms, JVM memory management and Kubernetes, work together.

JVM and Kubernetes

Java 10 introduces a new JVM flag —— -XX:+UseContainerSupport (set to true by default) that allows the JVM to detect available memory and CPU if it is running in a resource-constrained container environment. This flag is used with -XX:MaxRAMPercentage and lets us set the maximum heap size based on the percentage of total available memory. In the case of Kubernetes, the limits setting on the container is used as the basis for this calculation. For example - if the pod has a 2GB limit and the MaxRAMPercentage flag is set to 75%, the result will be a maximum heap size of 1500MB.

This requires some tricky because, as we saw earlier, Java applications have a higher overall memory footprint than the heap (and also Metaspace, threads, garbage collection, APM proxy, etc.). This means that you need to balance maximum heap space, non-heap memory usage, and pod limits. Specifically, the sum of the first two cannot exceed the last one, as it results in OOMKilled (see previous section).

To see both mechanisms in action, we'll use the same example project, but this time we'll deploy it on a (local) Kubernetes cluster. To deploy the application on Kubernetes, we package it as a pod:

apiVersion: v1
kind: Pod
metadata:
  name: heapkiller
spec:
  containers:
    - name: heapkiller
      image: heapkiller
      imagePullPolicy: Never
      resources:
        requests:
          memory: "500Mi"
          cpu: "500m"
        limits:
          memory: "500Mi"
          cpu: "500m"
      env:
        - name: JAVA_TOOL_OPTIONS
          value: '-XX:MaxRAMPercentage=70.0'

Quick refresher Part 1 – We determined that the application requires at least 380MB of heap memory to function properly.

Scenario 1 — Java Out Of Memory error

Let's first understand the parameters that we can manipulate. These are — requests and limits for pod memory, and Java's maximum heap size, which in our case is controlled by the MaxRAMPercentage flag.

In the first case, we allocate 70% of the total memory to the heap. Both pod requests and limits are set to 500MB, which results in a maximum heap of 350MB (70% of 500MB).

We execute kubectl apply -f pod.yaml to deploy the pod, and then observe the logs with kubectl get logs -f pod/heapkiller. Shortly after the application starts, we see the following output:

INFO  Started HeapDestroyerKt in 5.599 seconds (JVM running for 6.912)
INFO           Used          Free            Total
INFO       17.00 MB       5.00 MB       22.00 MB
...
INFO      260.00 MB      78.00 MB      338.00 MB
...
Exception in thread "main" java.lang.reflect.InvocationTargetException
Caused by: java.lang.OutOfMemoryError: Java heap space

If we do kubectl describe pod/heapkiller to pull out the pod details, we will find the following information:

Containers:
  heapkiller:
    ....
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
...
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
...
  Warning  BackOff    7s (x7 over 89s)   kubelet            Back-off restarting failed container

In short, this means that the pod exits with status code 1 (the exit code for Java Out Of Memory), and Kubernetes will continue to restart it with the standard backoff strategy (exponentially increasing the pause time between restarts). The following diagram depicts this scenario.

The key takeaway in this case is that if Java fails with an OutOfMemory error, you'll see it in the pod logs.

Scenario 2 — Pod exceeds memory limit

To achieve this scenario, our Java application needs more memory. We'll increase MaxRAMPercentage from 70% to 90% and see what happens. We follow the same steps as before and look at the logs. The app worked well for a while:

...
...
INFO      323.00 MB      83.00 MB      406.00 MB
INFO      333.00 MB      73.00 MB      406.00 MB

And then...... Burst. No more logs. We run the same describe command as before to get details about the pod state.

Containers:
  heapkiller:
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
Events:
  Type     Reason     Age                  From              Message
 ----     ------     ----                 ----               ------
...
...
 Warning  BackOff    6s (x7 over 107s)    kubelet            Back-off restarting failed container

At first glance, this looks similar to the previous scenario — pod crash, now in CrashLoopBackOff (Kubernetes has been restarting), but it's actually very different. Previously, the process in the pod exited (the JVM crashed with an out-of-memory error), in this case, it was Kubernetes that killed the pod. The OOMKill state indicates that Kubernetes stopped the pod because it has exceeded its allocated memory limit. How is this possible?

By allocating 90% of the available memory to the heap, we assume that everything else fits into the remaining 10% (50MB), which is not the case for our application, which results in a memory footprint that exceeds the 500MB limit. The following diagram shows a scenario where the pod memory limit is exceeded.

Essentials - OOMKilled looks in the state of the pod.

Scenario 3 — The pod exceeds the node's available memory

The last, less common failure scenario is pod eviction. In this case — memory request and limit are different. Kubernetes schedules pods on nodes based on the request parameter instead of the limit parameter. If a node satisfies the request, kube-scheduler will choose it, regardless of the node's ability to meet the limit. Before we schedule a pod to a node, let's look at some details about that node:

~ kubectl describe node/docker-desktop

Allocatable:
  cpu:                4
  memory:             1933496Ki
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                850m (21%)   0 (0%)
  memory             240Mi (12%)  340Mi (18%)

We can see that the node has about 2GB of allocable memory and already occupies about 240MB (made up of kube-system pods such as etcd and coredns).

For this case, we adjusted the parameters of the pod - request: 500Mi (unchanged), limit: 2500Mi We reconfigured the application to fill the heap to 2500MB (previously 350MB). When pods are scheduled to nodes, we can see this allocation in the node description:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1350m (33%)  500m (12%)
  memory             740Mi (39%)  2840Mi (150%)

When the pod reaches the node's available memory, it is killed, and we see the following details in the description of the pod:

~ kubectl describe pod/heapkiller

Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: memory.
Containers:
  heapkiller:
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Reason:       OOMKilled

This indicates that the pod was evicted due to insufficient node memory. We can see more details in the node description:

~ kubectl describe node/docker-desktop

Events:
  Type     Reason                   Age                 From     Message
  ----     ------                   ----                ----     -------
  Warning  SystemOOM                1s                  kubelet  System OOM encountered, victim process: java, pid: 67144

At this point, the CrashBackoffLoop starts and the pod keeps restarting. The following diagram depicts this scenario.

Key takeaway - Look for Evicted events in the state of the pod and notify the node of low memory.

Scenario 4 - Parameters are well configured and the application runs well

The final scenario shows that the application is functioning properly with properly adjusted parameters. To do this, we set both the pod's request and limit to 500MB, and -XX:MaxRAMPercentage to 80%.

Let's gather some statistics to understand what's happening in pods at the node level and at a deeper level.

~ kubectl describe node/docker-desktop

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1350m (33%)  500m (12%)
  memory             740Mi (39%)  840Mi (44%)

The node looks healthy and has idle resources. Let's look inside the pod.

# Run from within the container
~ cat /sys/fs/cgroup/memory.current

523747328

This shows the current memory usage of the container. That's 499MB, right on the edge. Let's see what is taking up this memory:

# Run from within the container
~ ps -o pid,rss,command ax

  PID   RSS   COMMAND
    1 501652  java -XX:NativeMemoryTracking=summary -jar /app.jar
   36   472   /bin/sh
   55  1348   ps -o pid,rss,command ax

RSS, *Resident Set Size,* is a good estimate of the memory processes that are being occupied. It shows that 490MB (501652 bytes) is occupied by Java processes. Let's peel off another layer and look at the memory allocation of the JVM. The flag we pass to the Java process, -XX:NativeMemoryTracking, allows us to collect detailed runtime statistics about the Java memory space.

~ jcmd 1 VM.native_memory summary

Total: reserved=1824336KB, committed=480300KB
-                 Java Heap (reserved=409600KB, committed=409600KB)
                            (mmap: reserved=409600KB, committed=409600KB)

-                     Class (reserved=1049289KB, committed=4297KB)
                            (classes #6760)
                            (  instance classes #6258, array classes #502)
                            (malloc=713KB #15321)
                            (mmap: reserved=1048576KB, committed=3584KB)
                            (  Metadata:   )
                            (    reserved=32768KB, committed=24896KB)
                            (    used=24681KB)
                            (    waste=215KB =0.86%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=3584KB)
                            (    used=3457KB)
                            (    waste=127KB =3.55%)

-                    Thread (reserved=59475KB, committed=2571KB)
                            (thread #29)
                            (stack: reserved=59392KB, committed=2488KB)
                            (malloc=51KB #178)
                            (arena=32KB #56)

-                      Code (reserved=248531KB, committed=14327KB)
                            (malloc=800KB #4785)
                            (mmap: reserved=247688KB, committed=13484KB)
                            (arena=43KB #45)

-                        GC (reserved=1365KB, committed=1365KB)
                            (malloc=25KB #83)
                            (mmap: reserved=1340KB, committed=1340KB)

-                  Compiler (reserved=204KB, committed=204KB)
                            (malloc=39KB #316)
                            (arena=165KB #5)

-                  Internal (reserved=283KB, committed=283KB)
                            (malloc=247KB #5209)
                            (mmap: reserved=36KB, committed=36KB)

-                     Other (reserved=26KB, committed=26KB)
                            (malloc=26KB #3)

-                    Symbol (reserved=6918KB, committed=6918KB)
                            (malloc=6206KB #163986)
                            (arena=712KB #1)

-    Native Memory Tracking (reserved=3018KB, committed=3018KB)
                            (malloc=6KB #92)
                            (tracking overhead=3012KB)

-        Shared class space (reserved=12288KB, committed=12224KB)
                            (mmap: reserved=12288KB, committed=12224KB)

-               Arena Chunk (reserved=176KB, committed=176KB)
                            (malloc=176KB)

-                   Logging (reserved=5KB, committed=5KB)
                            (malloc=5KB #219)

-                 Arguments (reserved=1KB, committed=1KB)
                            (malloc=1KB #53)

-                    Module (reserved=229KB, committed=229KB)
                            (malloc=229KB #1710)

-                 Safepoint (reserved=8KB, committed=8KB)
                            (mmap: reserved=8KB, committed=8KB)

-           Synchronization (reserved=48KB, committed=48KB)
                            (malloc=48KB #574)

-            Serviceability (reserved=1KB, committed=1KB)
                            (malloc=1KB #14)

-                 Metaspace (reserved=32870KB, committed=24998KB)
                            (malloc=102KB #52)
                            (mmap: reserved=32768KB, committed=24896KB)

-      String Deduplication (reserved=1KB, committed=1KB)
                            (malloc=1KB #8)

This may be self-evident - this scenario is for illustrative purposes only. In a real-life application, I wouldn't recommend working with so few resources. How comfortable you feel will depend on how mature your observability practices are (in other words – how quickly you notice something is wrong), the importance of your workload, and other factors such as failover.

epilogue

Thank you for sticking with this long article! I'd like to offer some advice to help you stay out of trouble:

Set the memory request the same as limit, so that you can avoid pods being evicted due to insufficient node resources (the disadvantage is that it will lead to reduced node resource utilization).
Increase the pod's memory limit only when a Java OutOfMemory error occurs. If an OOMKilled crash occurs, leave more memory for non-heap use.
Set the maximum and initial heap sizes to the same values. This way, you will prevent a performance penalty in the event of an increase in heap allocation, and you will "fail fast" if heap percentage/non-heap memory/pod limit is wrong

Author: Ling Wan

Link: https://juejin.cn/post/7225141192606335031

Source: Rare Earth Nuggets

Tricky memory management for Java applications on Kubernetes