laitimes

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

author:JD Cloud developer

1. Problem discovery and troubleshooting

1.1 Find the cause of the problem

The problem was caused by the container CPU alert from jdos, and the CPU usage has reached 104%

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

Observing the machine log, it is found that there are many threads executing batch tasks at this time. Normally, the batch task is a low-CPU and high-memory type, so at this time, consider that it is a large amount of CPU occupation caused by FullGC (there was a similar situation before, and the problem was solved after restarting the application after informing the user).

View the memory usage of this machine through Tarzan:

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

It can be seen that the CPU usage is indeed high, but the memory usage is not high, only 62%, ** is within the normal range.

It's actually a little confusing here, and it stands to reason that the memory should be full at this time.

Later, according to other indicators, such as the sudden entry of traffic, it was also suspected that the CPU of the JSF interface was filled by a sudden large number of calls, so the memory usage was not high, but it was slowly ruled out. In fact, here is a little helpless, the phenomenon does not match the guess, only CPU growth and no memory growth, ** so what causes unilateral CPU growth? ** Then I checked in this direction for a long time, and they were also rejected.

Later, I suddenly realized that there was a "problem" with the monitoring?

In other words, it should be that there is a problem with the monitoring we see, and the monitoring here is the monitoring of the machine, not the monitoring of the JVM!

The CPU used by the JVM is reflected on the machine, and the high use of the JVM's heap memory is not obvious on the machine.

Then go to SGM to view the JVM related situation of the corresponding node:

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

You can see that our heap memory has indeed been filled and then cleaned up, and the CPU usage at this time can also correspond to the GC time.

Then it can be determined that it is Full GC that caused the problem.

1.2 Reasons to find FULL GC

We first dumped the heap memory snapshot before and after gc.

Then use JPofiler for memory analysis. (JProfiler is a heap memory analysis tool, which can be directly connected to the online jvm to view relevant information in real time, or you can analyze the heap memory snapshot that is dumped to analyze the heap memory at a certain time)

First unzip the file we dumped, modify the suffix .bin, and then open it. (We use the dump gadget that comes with Xingyun, or you can go to the machine and manually dump the file by commanding)

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

Start by selecting Biggest Objects to see the largest objects in heap memory at that time.

As you can see from the figure, the four List objects occupy nearly 900MB of memory, and we just saw that the maximum heap memory is only 1.3GB, so with other objects, it is easy to fill up the old age and cause the problem of full gc.

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

Select one of the largest objects as the one we want to view

At this time, we can already locate the corresponding location of the corresponding large memory object:

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

In fact, at this point we have been able to roughly locate the problem, if you are still unsure, you can view the specific object information, the method is as follows:

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

You can see that our large List object is actually a lot of Map objects inside, and there are many key-value pairs in each Map object.

Here you can also see the relevant property information in the Map.

You can also see the relevant information directly on the following interface:

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

Then click all the way down to see the corresponding properties.

At this point, we have theoretically found the location of large objects in the code.

Second, problem solving

2.1 Find the location of large objects in the code and the root cause of the problem

First, we find the corresponding position and logic according to the above process

The approximate logic of our project is this:

  1. First, the Excel sample uploaded by the user is parsed and loaded into memory as a List variable, which is the variable we saw above. A 20W sample, the number of fields at this time has A, which occupies about 100MB.
  2. It then iterates through the user sample, adding some additional request data based on the data in the user sample to request relevant results based on this data. At this time, the number of fields is A+N, and the occupied space is already about 200MB.
  3. After the loop completes, this 200MB of data is cached.
  4. Start generating Excel, fetch 200MB of data from the cache, and take out the initial sample fields to fill Excel based on the A fields recorded earlier.

Represented by a flowchart as:

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

Combine some images that specifically troubleshoot the issue:

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

One of the phenomena is that the minimum memory after each GC is gradually increasing, corresponding to the second step above, the memory is gradually expanding.

Conclusion:

Load the user-uploaded excel sample into memory and store it as a List<Map<String, String>> structure, first a 20MB excel file stored in this way will expand and occupy about 120MB heap memory, this step will occupy a lot of heap memory, and because of task logic, the large object memory will exist in the JVM for up to 4-12 hours, resulting in too many tasks, The JVM heap memory can easily fill up.

Here is a list of why using HashMap causes memory bloat, the main reason is that storage space is relatively inefficient:

A Long object occupies memory calculation: In the HashMap <Long, Long> structure, only the two long data stored by Key and Value are valid data, a total of 16 bytes (2×8 bytes). After these two long data are wrapped into java.lang.Long objects, they have 8 bytes of MarkWord, 8 bytes of Klass pointers, and 8 bytes of long value for storing data (a wrapper object occupies 24 bytes).

Then after these 2 Long objects form Map.Entry, there is an additional 16-byte object header (8-byte MarkWord + 8-byte Klass pointer = 16 bytes), and then an 8-byte next field and a 4-byte int-type hash field (8-byte next pointer + 4-byte hash field + 4-byte padding = 16 bytes), for alignment, 4-byte blank padding must also be added, Finally, there is a reference to the 8 bytes of this Entry in HashMap, so that two long integer numbers are added, and the actual memory consumed is (Long(24byte)×2) + Entry(32byte) + HashMapRef(8byte)=88byte, and the space efficiency is the valid data divided by the total memory space, that is, 16 bytes/88 bytes = 18%.

- Understanding the Java Virtual Machine 5.2.6

The following is the heap memory object that was dumped in the newly uploaded Excel, which occupies 128MB of memory, while the uploaded Excel is actually only 17.11MB.

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand
Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

Space efficiency: 17.1MB/128MB≈13.4%

2.2 How to resolve this issue

Leaving aside whether the above process is reasonable, the solution can generally be divided into two categories, one is the root cause, that is, the object is not put into the JVM memory, but stored in the cache, and if it is not in memory, the problem of large objects will be solved naturally. The other is palliative, that is, shrinking the large memory object so that it generally does not trigger frequent full GC problems in daily use scenarios.

Both methods have their advantages and disadvantages:

2.2.1 Radical treatment: do not store him in memory

The solution logic is also very simple, for example, when loading data, it is stored in the Redis cache one by one according to the sample loading data, and then we only need to know how many samples are in the number, and take the data from the cache in order of the number to solve the problem.

Advantages: It can fundamentally solve this problem, basically there will be no problem in the future, and the amount of data only needs to add the corresponding redis resources.

Disadvantages: First of all, it will increase a lot of redis cache space consumption, and secondly, from the display consideration, for our project, the code here is old and obscure, and the change requires a lot of effort and regression testing.

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

2.2.2 Conservative treatment: reduce the amount of data

Analyzing the above process of 2.1, the first third step is completely unnecessary, first deposit the cache and then take out, occupying additional cache space. (Speculation is a historical issue, which will not be delved into here).

Secondly, in the second step, the extra field n, which is already useless after the request ends, so consider deleting the useless field after the request ends.

There are also two solutions at this point, one is to just remove the useless field to reduce its map size, and then pass it as a parameter to generate excel for use; Another way is to request completion to delete the map directly, and then re-read the user-uploaded Excel sample when the Excel is generated.

Pros: Minor changes, no need for too complex regression testing

Disadvantages: In the case of extreme large data volumes, full GC can still occur

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

The specific implementation method will not be expanded.

One of the implementations

//获取有用的字段
String[] colEnNames = (String[]) colNameMap.get(Constant.BATCH_COL_EN_NAMES);
List<String> colList = Arrays.asList(colEnNames);
//去除无用的字段
param.keySet().removeIf(key -> !colList.contains(key));
           

Third, expand thinking

First of all, the monitoring graph in this article is a common GC artificially created when reproducing the scene at that time.

In the CPU usage graph, you can observe that the CPU usage rise time does coincide with the GC time, but there is no 104% CPU usage in the scene at that time.

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand
Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

In fact, the direct reason is relatively simple, that is, although the system has full GC, it does not appear frequently.

A small range of low-frequency Full GC is unlikely to cause a CPU spike in the system, which is also a phenomenon we have seen.

So what was the reason for the scene?

Online FullGC troubleshooting practice – teach you to troubleshoot online issues by hand

As we mentioned above, our large objects in heap memory will gradually expand with the progress of the task, so when our tasks are enough and the time is long enough, it may lead to the free space becoming smaller and smaller after each full GC, when the free space is small to a certain extent, each time the full GC is completed after finding that the space is still not enough to use, it will trigger the next GC, resulting in the frequent occurrence of GC in the final result, causing the CPU frequency to soar.

4. Problem troubleshooting summary

  • When we encounter high CPU usage on the line, we can first check whether the problem is caused by Full GC, pay attention to the monitoring of JVM, or use jstat related commands to view. Don't be misled by machine memory monitoring.
  • If it is determined that gc is causing the problem, you can use JProfiler to connect directly to the online jvm or use dump to save heap snapshots and analyze them offline.
  • The largest object can be found first, in general, the large object causes the full GC. There is also a situation where, unlike so obvious four large objects, it may be a more balanced dozen 50MB objects, and the specific situation needs to be analyzed specifically.
  • Find the code location corresponding to the stack after finding the object in question through the above tools, find the specific cause of the problem through code analysis, and guess whether it is correct through other phenomena, and then find the real cause of the problem.
  • Resolve the issue depending on the cause of the issue.

Of course, the above is just not a very complicated troubleshooting situation, different systems must have different memory situations, we should analyze specific problems, and what can be learned from this problem is how to troubleshoot and solve the problem.

Author: JD Technology Korean Kai

Content source: JD Cloud Developer Community