Crashed, too many open fils problem in the banking system at two o'clock in the morning!

Sharing personal experience, the bank's ESB system, which had just been online for three days, experienced a spike in load value during the system monitoring at two o'clock in the morning, and the system had a large-scale call failure, and the bank immediately started the backup system, resulting in no other losses. At three o'clock in the middle of the night, the project manager called everyone to the bank to solve the problem.

Problem location

At the beginning, the architect and senior engineer analysis may have an exception in the application OOM, so they dumped the log to check the memory usage of the JVM, but it took a long time to find any OOM anomalies. All GC and stack logs show that they are normal.

And after checking the ESB's internal monitoring platform, no OOM abnormalities or other alarms were found in the application itself, which is a bit troublesome.

After the architect sat in front of the computer and frantically output, he looked at Oracle and checked the server, and finally found that it turned out to be the error message of Too many open file caused by the message record. What the hell is this?

Too many open files: the number of handles exceeds the system limit, is a common problem in Linux systems, here files are not only system files, but also request connection sockets, port listeners, etc., because in the Linux system everything is treated as a file to process, so to operate on these things there will be corresponding file descriptors.

The cause of the problem

After some troubleshooting, the architect determined the problem after breakfast, because there will be temporary message storage in the ESB system, after loading the temporary packet storage, the handle of the deleted file is not released, and then the log is created hard in the log configuration, resulting in the number of file handles held exceeding the maximum number of Linux handle connections, resulting in subsequent message logs cannot be operated, and finally the application cannot provide services.

The principle of file handles

In Linux, everything such as directories, character devices, block devices, sockets, printers, etc. are abstracted into a single file. That is, everything in Linux is a file that is often mentioned.

When the system manipulates these files, in order to record the access of each operational file, a file descriptor, commonly known as fd, is created for this file, and these fds exist in the Linux open files table, and the fd that this table can hold is limited. If this value is exceeded, then Linux has no fd allocation, will reject file operation requests, and eventually there will be too many open files exception.

This exception also occurs when the server has more socket connections than the maximum number of server connections. This is because each socket connection is a file descriptor, and the server rejects the connection after 65535.

How to solve it?

View the number of file handle connections

In Linux, use the ulimit -a command to see the number of handle connections currently occupied. This shows how to use it instead of a real scenario.

[root@localhost ~]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31142
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 31142
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
[root@localhost ~]#

Where open files shows the largest handle connection that a single process is allowed to open

Find the PID of the corresponding process and view the file handle corresponding to the process.

[root@localhost ~]# ps -ef|grep java
root      7642  7602  0 09:48 pts/0    00:00:00 grep --color=auto java
root     32053     1  0 9月20 ?       00:07:34 java -jar swly-admin-0.0.9-8082.jar
[root@localhost ~]# lsof -p 32053 | wc -l
97
[root@localhost ~]# lsof -p 32053
COMMAND   PID USER   FD      TYPE             DEVICE  SIZE/OFF      NODE NAME
java    32053 root  cwd       DIR              253,0      4096   4721636 /data/swly/admin
java    32053 root  rtd       DIR              253,0       256       128 /
java    32053 root  txt       REG              253,0      8712 101830342 /usr/local/java/jdk/bin/java
java    32053 root  mem       REG              253,0    292520  34594181 /usr/local/java/jdk/jre/lib/amd64/libjpeg.so
java    32053 root  mem       REG              253,0   1048136  34594178 /usr/local/java/jdk/jre/lib/amd64/libmlib_image.so
java    32053 root  mem       REG              253,0    504840  34594192 /usr/local/java/jdk/jre/lib/amd64/libt2k.so

Under normal circumstances, this problem can be solved by adding open files, but the problems we have cannot be solved in this way, only to ensure temporary solutions. In order to actually solve it, you also need to look at the logic of the code, and find that it is a problem with logging in the code logic, and the file stream does not perform a close operation.

The temporary solution can be to execute the following command to increase the number of handle connections.

ulimit -n 2048 # 重启后会恢复默认值，非root用户只能设置到4096

Or modify it permanently by modifying the configuration file.

[root@localhost ~]# vi /etc/security/limits.conf
 
# 文件末加入
* soft nofile 655360
* hard nofile 655360

summary

Finally, with the joint efforts of architects and senior engineers, the problem was found by looking at the code, because the ESB system did not close the file stream Stream when temporarily storing the log log of the packet, resulting in the subsequent packet entering, the previous file descriptor was not released, and when the core system rewrite began at one o'clock in the morning, a large number of requests passed through the ESB system, resulting in such a production problem that the log log occupied too many file handles. Finally, the operation of deleting logs on a regular basis was added.

Crashed, too many open fils problem in the banking system at two o'clock in the morning!

Problem location

The cause of the problem

How to solve it?

summary