laitimes

Expose online JVM memory overflow issues caused by FileSystem

author:Flash Gene

This article describes the whole process of analyzing and solving the problem of memory overflow caused by an online memory leak caused by the FileSystem class.

Memory leak definition: An object or variable that is no longer used by the program still occupies storage space in memory, and the JVM cannot properly reclaim the modified object or variable. A memory leak may not seem to have a big impact, but the consequence of a memory leak accumulation is a memory overflow.

Out of memory: It is an error that the program cannot continue to execute due to insufficient allocated memory space or improper use during program operation, and the OOM error is reported.

1. Background

On the weekend, Xiaoye was killing in the King's Canyon, and the mobile phone suddenly received a large number of machine CPU alarms, and the CPU usage exceeded 80%, and it also received the full GC alarm of the service. This service is a very important service for Xiaoye's project team, Xiaoye hurriedly put down the glory of the king in his hand and opened the computer to check the problem.

Expose online JVM memory overflow issues caused by FileSystem
Expose online JVM memory overflow issues caused by FileSystem

图1.1 CPU告警 Full GC告警

2. Problem finding

2.1 Monitoring and viewing

If you open Service Monitoring to view the CPU monitoring and Full GC monitoring, you can see that both monitoring have an abnormal bump at the same point in time.

Expose online JVM memory overflow issues caused by FileSystem

Figure 2.1 CPU usage

Expose online JVM memory overflow issues caused by FileSystem

图2.2 Full GC次数

2.2 Memory Leaks

From the frequent use of Full Gc, we can know that there must be a problem with the memory recycling of the service, so check the monitoring of the heap memory, old memory, and young memory of the service, and you can see from the resident memory diagram of the old generation that there are more and more permanent memory in the old age, and the old age objects cannot be recycled, and finally the resident memory is all occupied, which can be seen obvious memory leakage.

Expose online JVM memory overflow issues caused by FileSystem

Figure 2.3 Old-age memory

Expose online JVM memory overflow issues caused by FileSystem

Figure 2.4 JVM memory

2.3 Memory Overflow

From the online error logs, it is also clear that the service is OOM at the end, so the root cause of the problem is memory leak causing memory overflow OOM, and finally causing the service to be unavailable.

Expose online JVM memory overflow issues caused by FileSystem

Figure 2.5 OOM logs

3. Problem troubleshooting

3.1 Heap memory analysis

After identifying the cause of the problem as a memory leak, we immediately dump the memory snapshot of the service and import the dump file to the MAT (Eclipse Memory Analyzer) for analysis. Leak Suspects goes to the Suspected Leak Point view.

Expose online JVM memory overflow issues caused by FileSystem

Figure 3.1 Memory object analysis

Expose online JVM memory overflow issues caused by FileSystem

Figure 3.2 Object link diagram

As shown in Figure 3.1, the open dump file has 2.3 GB of heap memory, of which the org.apache.hadoop.conf.Configuration object accounts for 1.8 GB and 78.63% of the entire heap memory.

Expand the associated objects and paths of the object, and you can see that the main object is the HashMap, which is held by the FileSystem.Cache object, and the upper layer is the FileSystem. It can be assumed that the memory leak is most likely related to the FileSystem.

3.2 Source Code Analysis

If you find the object of the memory leak, the next step is to find the code that is the memory leak.

In Figure 3.3 we can see this code every time we interact with hdfs, we establish a connection to hdfs and create a FileSystem object. However, after using the FileSystem object, the close() method is not called to release the connection.

However, the Configuration instance and the FileSystem instance here are both local variables, and after the method is executed, both of these objects should be reclaimable by the JVM, how can it cause a memory leak?

Expose online JVM memory overflow issues caused by FileSystem

Figure 3.3

(1)猜想一:FileSystem是不是有常量对象?

Next, let's look at the source code of the FileSystem class, and the init and get methods of the FileSystem are as follows:

Expose online JVM memory overflow issues caused by FileSystem
Expose online JVM memory overflow issues caused by FileSystem
Expose online JVM memory overflow issues caused by FileSystem

Figure 3.4

As you can see from the last line of code in Figure 3.4, there is a cache in the FileSystem class, and the disableCacheName controls whether to fetch objects from the cache. The default value of this parameter is false. That is, by default, the FileSystem is returned through the CACHE object.

Expose online JVM memory overflow issues caused by FileSystem

Figure 3.5

As can be seen from Figure 3.5, the cache is a static object of the FileSystem class, that is, the cache object will always exist and will not be recycled.

LET'S TAKE A LOOK AT THE CACHE.GET METHOD:

Expose online JVM memory overflow issues caused by FileSystem

As you can see from this code:

  1. Maintain a Map inside the Cache class, which is used to cache the connected FileSystem objects, and the key of the Map is the Cache.Key object. The FileSystem will be obtained by using the Cache.Key every time, and if it is not obtained, the process of creating the file will continue.
  2. Inside the Cache class, a Set (toAutoClose) is maintained to store connections that need to be automatically closed. Connections in that collection are automatically closed when the client closes.
  3. Each time a FileSystem is created, it will be stored in a Map in the Cache class with Cache.Key as the key and FileSystem as the Value. As for whether there will be multiple caches for the same hdfs URI during caching, you need to check the hashCode method of Cache.Key.

Cache.Key的hashCode方法如下:

Expose online JVM memory overflow issues caused by FileSystem

The schema and authority variables are of type String, and if they have the same URI, their hashCode is the same. The value of the unique parameter is 0 every time. Then the hashCode of Cache.Key is determined by ugi.hashCode().

From the above code analysis, we can sort out:

  1. When a service code interacts with hdfs, a new FileSystem connection is created for each interaction, and the FileSystem connection is not closed at the end of the interaction.
  2. The FileSystem has a built-in static cache, which has a map inside the cache to cache the FileSystem that has created a connection.
  3. 参数fs.hdfs.impl.disable.cache,用于控制FileSystem是否需要缓存,默认情况下是false,即缓存。
  4. Cache中的Map,Key为Cache.Key类,该类通过schem,authority,ugi,unique 4个参数来确定一个Key,如上Cache.Key的hashCode方法。

(2)猜想二:FileSystem同样hdfs URI是不是多次缓存?

FileSystem.Cache.Key构造函数如下所示:ugi由UserGroupInformation的getCurrentUser()决定。

Expose online JVM memory overflow issues caused by FileSystem

继续看UserGroupInformation的

getCurrentUser() method, as follows:

Expose online JVM memory overflow issues caused by FileSystem

The key is whether the Subject object can be obtained through the AccessControlContext. In this example, when get(final URI uri, final Configuration conf, final String user) is obtained, a new Subject object can be obtained every time during debugging. This means that the same hdfs path will cache a FileSystem object each time.

Hypothesis 2 is verified: the same hdfs URI will be cached multiple times, resulting in rapid cache expansion, and the cache does not set an expiration time and retirement policy, which eventually leads to memory overflow.

(3)FileSystem为什么会重复缓存?

So why do we get a new Subject object every time, let's take a look at the code to get the AccessControlContext, as follows:

Expose online JVM memory overflow issues caused by FileSystem

One of the more critical ones is:

getStackAccessControlContext method, which calls the Native method as follows:

Expose online JVM memory overflow issues caused by FileSystem

This method returns the AccessControlContext object for the protection domain permissions of the current stack.

我们通过图3.6 get(final URI uri, final Configuration conf,final String user) 方法可以看到,如下:

  • 先通过UserGroupInformation.getBestUGI方法获取了一个UserGroupInformation对象。
  • 然后在通过UserGroupInformation的doAs方法去调用了get(URI type, Configuration conf)方法
  • 图3.7 UserGroupInformation.getBestUGI方法的实现,此处关注一下传入的两个参数ticketCachePath,user。 ticketCachePath是获取配置hadoop.security.kerberos.ticket.cache.path的值,在本例中该参数未配置,因此ticketCachePath为空。 user参数是本例中传入的用户名。
  • ticketCachePath为空,user不为空,因此最终会执行图3.7的createRemoteUser方法
Expose online JVM memory overflow issues caused by FileSystem

Figure 3.6

Expose online JVM memory overflow issues caused by FileSystem

Figure 3.7

Expose online JVM memory overflow issues caused by FileSystem

Figure 3.8

As you can see from the code highlighted in red in Figure 3.8, a new Subject object is created in the createRemoteUser method, and a UserGroupInformation object is created from that object. At this point, the UserGroupInformation.getBestUGI method is executed.

接下来看一下UserGroupInformation.doAs方法(FileSystem.get(final URI uri, final Configuration conf, final String user)执行的最后一个方法),如下:

Expose online JVM memory overflow issues caused by FileSystem

Then call the Subject.doAs method as follows:

Expose online JVM memory overflow issues caused by FileSystem

Finally, the AccessController.doPrivileged method is called, as follows:

Expose online JVM memory overflow issues caused by FileSystem

This method is a native method and is executed using the specified AccessControlContext

PrivilegedExceptionAction,也就是调用该实现的run方法。 即FileSystem.get(uri, conf)方法。

至此,就能够解释在本例中,通过get(final URI uri, final Configuration conf,final String user) 方法创建FileSystem时,每次存入FileSystem的Cache中的Cache.key的hashCode都不一致的情况了。

To sum up:

  1. 在通过get(final URI uri, final Configuration conf,final String user)方法创建FileSystem时,由于每次都会创建新的UserGroupInformation和Subject对象。
  2. When the hashCode is computed on the Cache.Key object, the UserGroupInformation.hashCode method is called.
  3. UserGroupInformation.hashCode方法,计算为:System.identityHashCode(subject)。 即如果Subject是同一个对象则返回相同的hashCode,由于在本例中每次都不一样,因此计算的hashCode不一致。
  4. In summary, the hashCode of each calculation Cache.key is inconsistent, and the cache of the FileSystem will be repeatedly written.

(4)FileSystem的正确用法

From the above analysis, since FileSystem.Cache does not play its due role, why should this cache be designed? It's just that we didn't use it right.

In FileSystem, there are two overloaded get methods:

public static FileSystem get(final URI uri, final Configuration conf, final String user)
public static FileSystem get(URI uri, Configuration conf)           
Expose online JVM memory overflow issues caused by FileSystem

我们可以看到 FileSystem get(final URI uri, final Configuration conf, final String user)方法最后是调用FileSystem get(URI uri, Configuration conf)方法的,区别在于FileSystem get(URI uri, Configuration conf)方法于缺少也就是缺少每次新建Subject的的操作。

Expose online JVM memory overflow issues caused by FileSystem

Figure 3.9

If there is no operation to create a new subject, then the subject in Figure 3.9 is null, and the last getLoginUser method will be used to obtain loginUser. loginUser is a static variable, so once the loginUser object is initialized, it will always be used. The UserGroupInformation.hashCode method will return the same hashCode value. That is, it can successfully use the cache cache cache in the FileSystem.

Expose online JVM memory overflow issues caused by FileSystem
Expose online JVM memory overflow issues caused by FileSystem

Figure 3.10

Fourth, the solution

After the previous introduction, if you want to solve the memory leak problem of FileSystem, we have the following two ways:

(1)使用public static FileSystem get(URI uri, Configuration conf):

  • This method can use the cache of the FileSystem, which means that there will only be one FileSystem connection object for the same hdfs URI.
  • Use System.setProperty("HADOOP_USER_NAME", "hive") to set the access user.
  • By default, fs.automatic.close=true, that is, all connections will be closed via the ShutdownHook.

(2)使用public static FileSystem get(final URI uri, final Configuration conf, final String user):

  • As analyzed above, the cache of the FileSystem will be invalidated, and it will be added to the Cache Map every time, so that it cannot be reclaimed.
  • One solution is to ensure that only one FileSystem connection object exists for the same hdfs URI.
  • Another option is to call the close method after each use of the FileSystem, which will delete the FileSystem in the cache.
Expose online JVM memory overflow issues caused by FileSystem
Expose online JVM memory overflow issues caused by FileSystem
Expose online JVM memory overflow issues caused by FileSystem

Based on the minimal changes to the historical code we already have, we chose the second modification method. We close the FileSystem object every time we finish using it.

5. Optimize the results

After the code is repaired and released, as shown in Figure 1 below, you can see that the old memory can be recycled normally after the fix, and the problem has finally been solved.

Expose online JVM memory overflow issues caused by FileSystem
Expose online JVM memory overflow issues caused by FileSystem

6. Summary

Memory overflow is one of the most common problems in Java development, and is usually caused by memory leaks that prevent memory from being reclaimed properly. In this article, we will take a closer look at a complete online memory overflow process.

To summarize the common solutions we use when we encounter a memory overflow:

(1) Generate heap memory files:

Add in the service start command

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/base           

Have the service automatically dump the memory file when an OOM occurs, or use the jamp command to dump the memory file.

(2) Heap memory analysis: Use memory analysis tools to help us analyze the memory overflow problem more deeply and find the cause of the memory overflow. Here are a few commonly used memory analysis tools:

  • Eclipse Memory Analyzer: An open-source Java memory analysis tool that helps us quickly locate memory leaks.
  • VisualVM Memory Analyzer: A graphical interface-based tool that helps us analyze the memory usage of java applications.

(3) Locate the specific memory leak code based on the heap memory analysis.

(4) Modify the memory leak code and republish the verification.

Memory leaks are a common cause of memory overflows, but they are not the only cause. Common causes of memory overflow problems include oversized objects, heap memory allocations that are too small, and dead-loop calls.

When encountering the problem of memory overflow, we need to think from multiple aspects and analyze the problem from different angles. Through the methods and tools we mentioned above, as well as various monitoring, we help us locate and solve problems quickly, and improve the stability and availability of our system.

作者:Ye Jidong

Source-WeChat public account: vivo Internet Technology

Source: https://mp.weixin.qq.com/s/_OtCE-BBQiLRAS14ZtDDJw

Read on