Java microservice production accidents - remember a special OOM troubleshooting and analysis

After many years of employment, in the face of the production environment, although they are cautious and cautious, it is inevitable to poke out of the basket. Light is sweaty and red-faced. In the case of a system shutdown, the system is down, and funds are lost. Behind every production accident, there are valuable experiences and lessons, and the history of blood and tears of project members. In order to better prevent and curb all kinds of accidents in the future, this special topic is specially opened to update and record all kinds of accidents large and small for a long time. Some are personal experiences, some are dictated by human ears, but they are all real cases.

Note: In order to avoid unnecessary troubles and business secret problems, the specific names mentioned in the text will be pseudonyms and pronouns.

outline

1 Background of the accident

At 14:19 on March 10, 2023, the developer of Company C reported to the developer of Company A that an open interface could not be accessed and used from 14:00 on March 10, 2023. The system is a basic data interface service and communicates based on the HTTP protocol. According to the convention, first check whether the network is abnormal, and the O&M personnel check to prove that there is no problem with network connectivity. The development team of Company A notified the O&M personnel to restart the application service at 14:30 on March 10, 2023, and the application service was briefly restored to normal during the period. However, soon, ten minutes later, the phone rang again, informing that the service had been abnormal again and could not be accessed. In order to avoid further expansion of the impact, Company A decided to urgently roll back the program to the last stable version. After the rollback, the system business functions return to normal. After a short sigh of relief, start troubleshooting the issue.

2 Accident analysis

Let O&M copy and pin system logs and application packages before and after updates. According to the previous fault phenomenon, the preliminary guess is a memory problem, but fortunately the parameter is added in the application start-stop script -XX:+HeapDumpOnOutOfMemoryError -XX: HeapDumpPath=/app/logs/app.dump (for those who cannot use jstack, jmap and other commands in the production environment to directly check the error - in fact, most of the time it cannot, dump files are particularly important), sure enough, The app.dump file appeared in the log directory, and searching in the log found several memory overflow errors java.lang.OutOfMemoryError: Java heap space, but it is inexplicable that the location of the OOM error is different every time, and things gradually get complicated.

Opening the dump file with the MAT (Memory Analyzer Tool) tool, I expected to find that a certain type of object was taking up a lot of memory, but unexpectedly, the Histogram (histogram) showed that there were only more than 100 active objects! Try Calculate Precise Retained Size, and the result is not much different from the previous one. Checking Outgoing References and Incoming References also showed no significant abnormalities, which was overwhelming.

Wipe the sweat, the log has clearly prompted us java.lang.OutOfMemoryError: Java heap space, first of all sure that this is a problem caused by heap memory space, the possible causes are:

The amount of data loaded in memory is too large
For example, database query statements that are not limited by the number of rows, or file reads that do not limit the number of bytes, etc., the accident system obviously does not have these situations;
Memory leak (resource not closed/unable to reclaim)
OOM also occurs when the system has a large number of unclosed IO resources, or uses scenarios such as ThreadLocal incorrectly.
The system is low on memory
The system memory is insufficient to support the memory required by the current business scenario, too small machine memory, or unreasonable JVM memory parameters.

If all reasonable options are excluded, will the least reasonable one be the answer? So I started to check the memory of the machine, according to the operation and maintenance, the machine memory is 16GB, and the top command to check that the java process occupies about 7.8GB of memory, which seems to be fine.

But then another colleague noticed something, the last time the system was upgraded, the application start-stop script was changed, and compared with the old version of the script, it was found that the difference was the memory parameter:

The old version was:

-Xms8g -Xmx8g -Xmn3g

The new version should be changed to:

-Xms8g -Xmx8g -Xmn8g

Seeing this, all the colleagues in front of the screen were speechless...

3 Causes of the accident

Why does setting the -Xmn parameter to the same size as the -Xmx parameter cause OOM? The JDK version used by the project is 1.8, take a look at the memory model of JDK 8:

Java microservice production accidents - remember a special OOM troubleshooting and analysis

It is not difficult to find that Heap Space Size = Young Space Size + Old Space Size, and the -Xmn parameter controls the size of the Young area, when the heap area is completely crowded by Young Gen, and some objects want to upgrade to Old Gen, it is found that the Old area space is insufficient, so it triggers Full GC, and after triggering Full GC, it usually faces two situations:

The Young area just made a little space, and the objects did not need to be placed in the Old area, and everyone was happy
The young area space is still not enough, the object still has to be put in the old area, the old area space is not enough, pawn, hi mention OOM
Eh, it's to go to the old district, don't care if you Young is not young, the old area is not enough space, pawn, hi mention OOM

This explains why when the system is just started, there will be a short time to work normally, and then, when a certain program triggers the Old Gen upgrade, a random OOM error will occur. So when will the object enter the old age? It is also very interesting here, you may wish to combine the place where OOM appears in the log, and take a seat:

Objects that have experienced enough GCs to remain alive
Apply for a large object (e.g. more than half the size of the Eden zone)
The size of the post-GC Eden object exceeds the sum of the S-zones
After the Eden zone + S0 area GC, the S1 area cannot be put down

In other words, under normal circumstances, the -Xmn parameter should always be less than the -Xmx parameter, otherwise an OOM error will be triggered. We can construct a simple example to verify this scenario. First, a simple SpringBoot program:

package com.example.oom;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.Random;

@SpringBootApplication
public class OomApplication {
    static final byte[] ARRAY = new byte[128 * 1024 * 1024];

    public static void main(String[] args) {
        SpringApplication.run(OomApplication.class, args);
    }

    @RestController
    public static class OomExampleController {
        @GetMapping("/oom")
        public int oom() {
            byte[] temp = new byte[128 * 1024 * 1024];
            temp[0] = (byte) 0xff;
            temp[temp.length - 1] = (byte) 0xef;
            int noise = new Random().nextInt();
            ARRAY[0] = (byte) (temp[0] + temp[temp.length - 1] + noise);
            return ARRAY[0];
        }
    }
}

After packaging with the mvn clean package command, we start it with the following command:

java -Xms512m -Xmx512m -Xmn512m -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -Xloggc:gc.log -jar oom-1.0.0-RELEASE.jar

Then with the help of Apache's ab .exe, we completed our validation tests. First, 1 concurrent access to the above SpringBoot interface 100 times:

ab -c 1 -n 100 http://localhost:8080/oom

You will find that it actually works normally, and then we simulate the situation after the user load comes up, using 2 concurrent accesses 100 times:

ab -c 2 -n 100 http://localhost:8080/oom

If the previous steps are correct, you should see a large number of OOM errors in the SpringBoot application console, as shown in the following figure:

Then in the GC log, you will see that before and after triggering GC, there is almost no space in the Old area, and the only bit is forcibly allocated by the JDK (forcing our -Xmn parameter to be overwritten when starting the JVM):

{Heap before GC invocations=279 (full 139):
 PSYoungGen      total 458752K, used 273877K [0x00000000e0080000, 0x0000000100000000, 0x0000000100000000)
  eden space 393728K, 69% used [0x00000000e0080000,0x00000000f0bf5798,0x00000000f8100000)
  from space 65024K, 0% used [0x00000000fc080000,0x00000000fc080000,0x0000000100000000)
  to   space 65024K, 0% used [0x00000000f8100000,0x00000000f8100000,0x00000000fc080000)
 ParOldGen       total 512K, used 506K [0x00000000e0000000, 0x00000000e0080000, 0x00000000e0080000)
  object space 512K, 98% used [0x00000000e0000000,0x00000000e007e910,0x00000000e0080000)
 Metaspace       used 35959K, capacity 38240K, committed 38872K, reserved 1083392K
  class space    used 4533K, capacity 4953K, committed 5080K, reserved 1048576K
2023-04-07T01:44:25.348+0800: 57.446: [GC (Allocation Failure) --[PSYoungGen: 273877K->273877K(458752K)] 274384K->274384K(459264K), 0.0441401 secs] [Times: user=0.06 sys=0.30, real=0.04 secs] 
Heap after GC invocations=279 (full 139):
 PSYoungGen      total 458752K, used 273877K [0x00000000e0080000, 0x0000000100000000, 0x0000000100000000)
  eden space 393728K, 69% used [0x00000000e0080000,0x00000000f0bf5798,0x00000000f8100000)
  from space 65024K, 0% used [0x00000000fc080000,0x00000000fc080000,0x0000000100000000)
  to   space 65024K, 9% used [0x00000000f8100000,0x00000000f86e2070,0x00000000fc080000)
 ParOldGen       total 512K, used 506K [0x00000000e0000000, 0x00000000e0080000, 0x00000000e0080000)
  object space 512K, 98% used [0x00000000e0000000,0x00000000e007e910,0x00000000e0080000)
 Metaspace       used 35959K, capacity 38240K, committed 38872K, reserved 1083392K
  class space    used 4533K, capacity 4953K, committed 5080K, reserved 1048576K
}
{Heap before GC invocations=280 (full 140):
 PSYoungGen      total 458752K, used 273877K [0x00000000e0080000, 0x0000000100000000, 0x0000000100000000)
  eden space 393728K, 69% used [0x00000000e0080000,0x00000000f0bf5798,0x00000000f8100000)
  from space 65024K, 0% used [0x00000000fc080000,0x00000000fc080000,0x0000000100000000)
  to   space 65024K, 9% used [0x00000000f8100000,0x00000000f86e2070,0x00000000fc080000)
 ParOldGen       total 512K, used 506K [0x00000000e0000000, 0x00000000e0080000, 0x00000000e0080000)
  object space 512K, 98% used [0x00000000e0000000,0x00000000e007e910,0x00000000e0080000)
 Metaspace       used 35959K, capacity 38240K, committed 38872K, reserved 1083392K
  class space    used 4533K, capacity 4953K, committed 5080K, reserved 1048576K
2023-04-07T01:44:25.392+0800: 57.490: [Full GC (Ergonomics) [PSYoungGen: 273877K->142631K(458752K)] [ParOldGen: 506K->506K(512K)] 274384K->143137K(459264K), [Metaspace: 35959K->35959K(1083392K)], 0.0248171 secs] [Times: user=0.14 sys=0.00, real=0.03 secs]

Then, without changing any code, let's adjust the startup parameters like this:

java -Xms512m -Xmx512m -Xmn64m -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -Xloggc:gc.log -jar oom-1.0.0-RELEASE.jar

You'll find it okay again. This is an extreme example built for verification, and the actual production application situation will be much more complex than this, but this does not prevent us from understanding its intent.

4 Accident review

This is a typical "human disaster", which comes from the "tuning" of a colleague, and what is more important is the inspiration it brings us than accountability:

Even if the start-stop script is applied, it should be included in the test verification process and online checklist as part of the program, and arbitrary changes are prohibited;
Many times, the default is the best, and overkill is often overdone.

5 Impact of accidents

As a result, the key business of Company C was shut down for half an hour, and the production system was urgently rolled back. The relevant person in charge of Company A prepared an accident report overnight.