Where does my memory go?

I. Preface

　　In recent years, some large-scale applications have been developed, and in the process of program performance tuning or solving some difficult problems, the most encountered problems are memory-related problems. For example, the glibc memory allocator ptmalloc, Google's memory allocator tcmalloc both have "memory leaks", that is, the problem that the memory is not returned to the operating system; Poor performance of ptmalloc memory allocation; With the long-term running of the system, buffer/cache is used a lot by some applications, almost completely occupying the system memory, causing other applications to fail memory requests and other problems.

　　The reason why memory-related problems continue to emerge is that its status is too important. This time it's still memory-related, and the sharing is tracking who (which applications) stole the memory occupied by the buffer/cache!

　　Most articles about buffer/cache talk about how to free and return to the system, but there are very few articles analyzing the reasons behind buffer/cache memory consumption. Why does buffer/cache grow, and which programs it is used, I believe this is also the confusion of many peers, so I want to share some tracking methods for buffer/cache memory consumption problems through this article to provide some reference for the optimization and solution of similar problems.

2. Problem description

　　As shown in Figure 1 and Figure 2 below, buffer/cache has taken up 46GB of memory, which is 37% of the total system memory, which is already a very high percentage. Buffer/cache is not freed for a long time, and the available memory for other processes on the system is almost gone.

　　What problems will arise in the long run? The most immediate problem is that other processes cannot be played, such as larger memory blocks. I have shared the positioning of related problems before, see the link: https://www.cnblogs.com/t-bar/p/16626951.html, I also introduced the buffer/cache release method in detail in this article, which solved the urgent need at that time.

　　Why has it recently become entangled with buffer/cache?

　　"echo 1 > /proc/sys/vm/drop_caches" frees all caches, which are some file information loaded into memory by all programs on the current system during operation, which is used as a cache, with the advantage that the next time the CPU reads a file, it will be much faster than the first time it reads from disk. drop_caches will empty all caches when executed, which will cause a problem: when some programs need to read information that was previously loaded into the cache, they need to re-read from disk, which creates IO waits, or IO contention, which drags down program performance. On some platforms, we've seen high-performance programs that produce performance jitter due to rough emptying of caches. Therefore, we can't avoid the question of who uses buffer/cache as before, and the strategy of direct brutal release will fail on some platforms.

　　Based on the above description, the problem we currently face is who is occupying the buffer/cache, and whether it is possible to avoid its heavy use of the buffer/cache by figuring out who occupies it. Faced with this problem, the boss has recently become angry again.

Buffer/cache uses tracing

　　Let's start with the tracing idea of investigating buffer/cache usage.

　　1、hcache

　　There are some posts on the Internet that hcache can see which files use cache, so can hache really help us to fully investigate buffer/cache? Let's take a look together.

　　According to the previous issue, the current buffer/cache already occupies 46GB of memory. First use hcache to check the cache occupation of the top 100, as shown in the figure below (intercepting the top part of the cached).

　　The top100, even if the sum of the size statistics of the top200, is only a few GB, far from 46GB, and the results show that hcache misses a lot of cache usage statistics.

　　hcache also has the ability to view the cache currently used by a process. Let's take a look at clickhouse's cache usage, and the result is shown in the figure below.

　　The running clickhouse can only see the current cache occupation of the program executable itself, but the cache file that has been opened during the program running is not counted. However, there is a small catch: after the program is loaded into memory, the memory used by the program's executable file and dependent library files is in buffer/cache.

　　From the above results, it is found that hcahce has many shortcomings, only a rough view of some executable program files, or the cache size used by some library files, and no statistics on the cache use of each program running state, so the troubleshooting effect on cache occupation problems is very limited.

　　2、top + lsof + fincore

　　Looking for a lot of information, there is really no other way to count the cache size consumed by the current running program except hcache, but hcache itself is not reliable. If there is no direct way, then there is only Wei Wei to save Zhao, which is why the buffer/cache distribution is inconvenient to track and investigate.

　　Where to start? Of course, the top command gives direction, which programs have high CPU usage and use a certain amount of memory, then check it. Because only they are likely to use the cache continuously, the general direction of investigation has been.

　　What's next? The use of buffer/cache must be related to files, or the sentence: Linux everything is a file. Is there a way to see what files a process is currently open in real time? The lsof command can! We use LSOF to check clickhouse, at a certain moment, the file opened by clickhouse is shown in Figure 7 and Figure 8 below, which is too long, and Figure 7 only intercepts the previous part.

　　Figure 8 only intercepts the type TYPE=REG (REG means that the file type is normal, and DIR is the directory, etc.), that is, it intercepts some files of type that clickhouse is currently open and is in use.

　　Constant execution: lsof -p $ (pidof clickhouse-server), found that the file name is different every time you see it. Well, this shows that clickhouse will constantly open, read, write, and close files during operation. The suspicion is very serious.

　　What's next? Is there a way to check in real time whether these files are currently using a cache, and the size of each cache? Indeed, fincore can see the cache size used by a file, link: https://github.com/david415/linux-ftools. The wheels are complete, what do you want.

　　命令行：lsof -p $(pidof clickhouse-server) | grep 'REG' | awk '{print $9}' | xargs ./fincore --pages=false --summarize --only-cached *

　　The screenshot is larger, and it will be clearer when you click on it.

　　Fincore counts that when the command line is executed, the sum of caches used by the files currently opened by clickhouse is about 1.2GB. At this point, the current exploration results are getting closer and closer to the question mentioned earlier: who exactly is occupying the buffer/cache.

　　Through top + lsof, a very important clue was found, that is, clickhouse frequently opens a lot of files under the directory /opt/runtime/esch/ch/store, so what are the files under this directory? Do you use cache? Is ClickHouse a big consumer of cache? Solving these doubts creates another need, which requires a tool that can count the cache size of a specified directory. Fincore has an Achilles heel, that is, it can only get the cache occupancy size of a specified file, and cannot get the cache size used by the specified directory, let alone count the cache size of nested directories. Therefore, it is time to ask vmtouch to appear, link: https://github.com/hoytech/vmtouch, or this sentence: the wheel is complete, what do you want.

　　3、vmtouch

　　vmtouch can count the cache occupancy size of a specified directory, even if it is nested.

　　Can't wait to get straight to the topic and see what's under clickhouse directory/opt/runtime/esch/store, and how many caches are used. Some files in this directory are intercepted, as shown in the following figure.

　　Directly count the cache size occupied by the /opt/runtime/esch/ch/store directory, and the result is shown in the figure below.

　　shit, actually ate my 42GB of RAM! The landlord's family doesn't have much left--- the boss cried.

　　I'm excited to check if it's the user of the 42GB cache! How to prove it, or use "echo 1 > /proc/sys/vm/drop_caches" and see if the size of free memory will increase by about 42GB after freeing.

　　Memory distribution before execution:

　　Memory distribution after execution:

　　After the cache release, free changed from 2GB to 45GB, expanding by 43GB; buffer/cache changed from 46GB to 3GB, a decrease of 43GB. The cache released 42GB + 1GB of clickhouse from the cache occupied by other programs, indicating that in our environment, clickhouse is a big consumer of cache! The boss boils or not I don't know, anyway, I'm boiling.

Clickhouse cache consumption

　　Why does clickhouse consume so much buffer/cache? Driven by curiosity, a new investigation began. At this time, a lyric came to mind: a wave has not subsided, another wave has invaded, a vast sea of people, a violent storm...

　　1. Reasons for the consumption of clickhouse cache

　　Where to start the investigation? Remembering the execution result of the lsof command, as shown in Figure 9 above, clickhouse has a lot of time to open the directory: /opt/runtime/esch/ch/store/032/03216cf6-357f-477f-bc9b-5eedb07a5d07, judging that there must be a lot of cache files under this directory. Directly enter the directory, continue to use vmtouch statistics, as expected, the result is shown in the figure below, the 032 directory eats 24GB of memory, what a heartache.

　　What mechanism of clickhouse would consume cache so insanely? Let's take a look at what types of files are in the directory and intercept some of the files.

　　It was found that the directory is mainly a lot of directory files that start with date numbers, some composed of pure numbers, and some directories with merge characters. Open a random catalog file for May 17: 20230517_563264_565136_5

　　20230517_563264_565136_5 directory occupies 2GB cache, is it surprising? And all the above files are completely loaded into the cache, such as the file cuid .bin which occupies 743MB in the disk, which also occupies 742MB in the cache.

　　After checking the clickhouse information, I found that the directories of digital numbers are many partition parts of clickhouse, and the clickhouse service will automatically merge these partitions in the background according to the relevant policies. Think about the performance overhead if you IO only the last two partitions from disk each time you merge partitions. Therefore, the developers of clickhouse will save the results of the last partition merge in the cache, and the next time the partition is merged with other partitions again, it is good to read the data directly from memory. That's why clickhouse consumes such a huge amount of cache. Of course, clickhouse's consumption of cache is positively correlated with the scale of data storage in your current environment.

　　Let's look at the question again, will all the partitions yesterday and the loaded data remain in the cache?

　　I looked for yesterday's partition and found that the files in yesterday's partition directory no longer occupy the cache. The partition of the previous day, clickhouse thought that it was a merged partition, and there was no need to merge anymore, naturally clearing the occupation of the cache, and the developer also thought of it.

　　2. Clickhouse cache consumption adjustment

　　However, the cache consumed by the partition that day, and the cache used in the merge process, made it impossible for me to play, especially in the scenario where the clickhouse service was not deployed independently. Can clickhouse itself support changing this mechanism? I started to look again with questions, and I couldn't stop at all.

　　Later, I found a related issue in the Clickhouse community that could save the use of cache. Link: https://github.com/ClickHouse/ClickHouse/issues/1566, there is a configuration min_part_size_for_direct_merge, which means that when the min_part_size is exceeded, the direct_io is enabled. That is, at this time, clickhouse will read and write the source and destination files of merge in a direct_io way, instead of using the cache cache, so as to reduce the use of cache.

　　Our clickhouse version is relatively high, look at the community records, clickhouse officially changed the previous min_part_size_for_direct_merge to min_merge_bytes_to_use_direct_io, Minimal amount of bytes to enable O_DIRECT in merge (0 - disabled). By default, if it exceeds 10GB, the direct_io method will be used to merge.

　　So would I set min_merge_bytes_to_use_direct_io small enough, even 1byte, to avoid using cache altogether? The answer is no, for the reason that min_merge_bytes_to_use_direct_io just used direct_io when reading and writing table data, replacing the commonly used buffer_io. That is to say, only the cache is not used in the data transfer process, which saves the cache memory consumption of this link. After the merge is completed, the data is written to disk through the direct_io, and the data after the merge is completed will continue to be used to cache the data after the merge, which is convenient for the next rapid merge with other partitions. Because each merge merges old data with new data, the cache used by the newly synthesized partition will only be larger than before the merge. The difference between direct_io and buffer_io is shown in the figure below.

　　It should be noted that the setup min_merge_bytes_to_use_direct_io also has a side effect, which causes disk IO to rise sharply when the merge behavior occurs. Because direct_io is a direct operation on the disk, this IO method has a greater impact on the disk than buffer_io (using page cache as a buffer layer). But if there are more banknotes, you can do disk raid, or add SSDs, the disk can withstand the impact of direct_io, but also support the silky query of the front, then it is another matter!

　　In addition, relevant parameters have been set: max_bytes_to_merge_at_max_space_in_pool, it is not very useful, so it will not continue to be introduced, and readers can verify it themselves. It can only be said that clickhouse currently does not provide a cleaning mechanism for the cache occupied by the merge partition file.

5. Cache cleanup

　　The operation is as fierce as a tiger, and the eyes are fixed on the pestle in place. ClickHouse itself cannot limit CAHce consumption; "echo 1 > /proc/sys/vm/drop_caches" is too crude and cleans up what other processes load into the cache. What should I do if I just want to get rid of a lot of cache occupied by clickhouse?

　　Sometimes you have to believe that there must be a road before the car reaches the mountain, and the boat will naturally go straight to the bridge. Please come out with vmtouch again!

　　VMkTouch's Help has an option for "-e", "Evict Pages from Memory", which, as the name suggests, evicts the Page cache from memory. Since vmtouch can count the cache occupied by a specified file or directory, it can naturally implement cahce cleanup of a specified file or directory!

　　Let's take a look at the execution first. There is a memory distribution before execution, as well as the usage of cache in a certain directory:

　　After executing vmtouch -e ./*, the memory distribution is shown in Figure 28 below. Surprise? It just reduces the 30GB cache occupied by a directory, and the free memory increases by 30GB, realizing the fixed-point cleaning of the cache consumption of the specified directory!

　　I am curious about vmtouch's method of implementing the memory cleaning of the specified directory, and I looked at the source code and simply pasted two pictures to roughly understand.

　　1. Pass the specified path;

　　2. If there is no directory under the path, the cache release operation is performed through the vmtouch_file function; If there are other files (including subdirectories) under that directory, traverse all files under that directory and implement recursive calls via vmtouch_crawl, going back to step 1.

　　So how does vmtouch_file function implement the cache release operation? The answer is through the system call: posix_fadvise, and use POSIX_FADV_DONTNEED macro. The definition is as follows:

1 int posix_fadvise(int fd, off_t offset, off_t len, int advice);
2  
3 advice:
4 POSIX_FADV_DONTNEED，指定的数据在未来的一段时间内不会被访问，丢弃page cache中的数据

　　When the advice is POSIX_FADV_DONTNEED, the kernel will first brush the dirty page and then clear the relevant page cache, so as to achieve the purpose of cache release!

　　However, after consulting the data, I found that the parameters used by posix_fadvise to implement dirty page brushing are: WB_SYNC_NONE, that is posix_fadvise will not synchronize the dirty page brushing behavior initiated by other processes, which may bring a problem: the cache under the specified path cannot be completely released, so you can use fsync before performing cache release, write back all dirty pages, and then call posix_fadvise after completion. Implements the release of all caches under the specified path. If you don't care if the cache used by the heart page can be fully released, then just use the posix_fadvise.

　　Writing here, some readers may have a question, is it really good to clean up the cache? The answer is yes. Why I say this, I think it can be summarized in the following two points:

　　First, after some services use the cache, this part of the cache will never be accessed again, and this type of cache consumption of course needs to be killed;

　　The second is applications like our environment clickhouse, which will use the cache used before. This situation is really difficult to handle, but if you don't do it, in the end, everyone will stop playing, and the fish and the bear's paw can't have both. Therefore, for this type of application, it is necessary to clean up the cache regularly and let everyone coexist through a compromise solution. For applications such as cache-dependent, after the cache is cleaned up, there is only hard work to get off the disk (people sit at home, and the pot comes from the sky).

6. Conclusion

　　The process of groping for the cache was tortuous and long, but fortunately a solution was finally found that was acceptable to everyone. The article is relatively long, briefly summarize the full text, which can be summarized into the following four points:

　　1. It verifies that hcache cannot count the overall consumption of cache, and recommends a method to detect active directories and files in the process through lsof+fincore, and uses this as a key investigation clue for cache consumption. This method is suitable for cache consumption surveys in a variety of scenarios, not just clickhosue.

　　2. Through vmtouch, you can count the size of a file, directory, or even the size of the file under the nested directory, so as to clarify the distribution of cache consumption.

　　3. The reasons for Clickhouse's large consumption of cache are analyzed, and whether Clickhouse itself has the ability and mechanism to reduce the use of Cache.

　　4. Provides a general method for cleaning up the cache occupied by each file or specified directory, which belongs to the riot operation of fixed-point cleaning.

　　Technology is constantly practiced and accumulated, share it here with everyone!

　　If the article is of some help to you, please also log in and like it from all technology enthusiasts, thank you very much!

Original link: https://www.cnblogs.com/t-bar/p/17359545.html