When optimizing CPU performance, we often need to analyze which functions the CPU resources in our application are used more in order to optimize efficiently. A very good analysis tool is the Flame Chart invented by Top of Performance author Brendan Gregg.

Dissect the internals of CPU performance flame graph generation

Today we will introduce the use of the flame diagram and how it works.

First, the use of flame diagrams

In order to better show the principle of the flame diagram, I specially wrote a short piece of code,

int main() {
    for (i = 0; i < 100; i++) {
        if (i < 10) {
            funcA();
        } else if (i < 16) {
            funcB();
        } else {
            funcC();
        }
    }
}

The complete source code I put on the Github where we develop internal strength cultivation: https://github.com/yanfeizhang/coder-kung-fu/blob/main/tests/cpu/test09/main.c.

Next, let's use this code to actually experience how the flame graph is generated. In this section, we will only talk about how to use it, and the following subsections will expand on the principle.

# gcc -o main main.c
# perf record -g ./main

At this point, a perf.data file is generated in the current directory where you execute the command. Next, we need to download Brendan Gregg's project to generate flame maps. We need two perl scripts in this project.

# git clone https://github.com/brendangregg/FlameGraph.git

Next, we use perf script to parse this output file, and pass the output into the FlameGraph/stackcollapse-perf.pl script for further parsing, and finally pass it to FlameGraph/flamegraph.pl to generate a flame graph in svg format. Specific commands can be completed in one line.

# perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > out.svg

In this way, a flame map is generated.

I chose to provide a demo code to generate because it is simple and clear enough for everyone to understand. In the above flame diagram, it can be seen that the main function calls funcA, funcB, funcC, where funcA calls funcD, funcE, and then the overhead of these functions is not spent by itself, but because of a CPU-intensive function caculate. The time-consuming statistics of the call stack of the entire system are clearly displayed.

If you want to optimize the performance of this project. In the flame diagram above, although the functions funcA, funcB, funcC, funcD, and funcE all take a long time, their time consumption is not used by themselves. And it's all spent in executing subfunctions. What we should really focus on is the long, flat function of the caculate at the top of the flame graph. Because it's the code that really spends CPU time. In other projects, too, after getting the flame map, start from the top, find out the functions that take a long time, and optimize them.

In addition, in the actual project, there may be very many functions, not as simple as above, many function names may be collapsed. This is also easy to do, the image in SVG format is interactive, you can click on one of the functions, and then you can expand the flame diagram that only looks at this function and its subfunctions in detail.

How, is the flame diagram quite simple to use. In the next section, we will talk about the internal principles of the whole process of flame diagram generation. With this understanding, you can talk about the handy use of the flame diagram.

2. Perf sampling

2.1 Introduction to Perf

The first step in generating a flame map is to sample the process or server you want to observe. There are several tools available for sampling, but here we use perf record.

# perf record -g ./main

In the above command, -g refers to the information to be recorded on the call stack when sampling./main is to start the main program and sample only this process. This is just the simplest way to use it, but perf record is very feature-rich.

It can specify acquisition events. The list of events currently supported by the system can be viewed using the perf list. By default, the cycles event under the Hardware event is collected. If we want to sample cache-misses events, we can specify it via the -e parameter.

# perf record -e cache-misses  sleep 5 // 指定要采集的事件

You can also specify how you want to sample. This command supports two sampling methods: time frequency sampling and event number sampling. The -F parameter specifies how many samples per second. The -c parameter specifies how many samples occur each time.

# perf record -F 100 sleep 5           // 每一秒钟采样100次
# perf record -c 100 sleep 5           // 每发生100次采样一次

You can also specify the CPU cores to record

# perf record -C 0,1 sleep 5           // 指定要记录的CPU号
# perf record -C 0-2 sleep 5           // 指定要记录的CPU范围

You can also collect the kernel's call stack

# perf record -a -g ./main

After execution with perf record, the sampled data is generated into the perf.data file. In the above experiment, although we only collected a few seconds, the generated file was quite large, with more than 800 KB. We can parse and view the contents of the file through the perf script command. There are probably more than 50,000 rows. The content is the call stack information when the cycles event is sampled.

......
59848 main 412201 389052.225443:     676233 cycles:u:
59849             55651b8b5132 caculate+0xd (/data00/home/zhangyanfei.allen/work_test/test07/main)
59850             55651b8b5194 funcC+0xe (/data00/home/zhangyanfei.allen/work_test/test07/main)
59851             55651b8b51d6 main+0x3f (/data00/home/zhangyanfei.allen/work_test/test07/main)
59852             7f8987d6709b __libc_start_main+0xeb (/usr/lib/x86_64-linux-gnu/libc-2.28.so)
59853         41fd89415541f689 [unknown] ([unknown])
......

In addition to perf script, you can use perf report to view and render results.

# perf report -n --stdio

2.2 Kernel Work Process

Let's take a brief look at how the kernel works.

The sampling process of perf is roughly divided into two steps, one is to call perf_event_open to open an event file, but to call read, mmap and other system calls to read the data sampled by the kernel. The overall workflow chart is roughly as follows

Among them perf_event_open several very important tasks have been accomplished.

Create various event kernel objects
Create various event file handles
Specifies the sampling processing callback

Let's take a look at a few of its key execution processes. The perf_event_alloc called in perf_event_open specifies the sampling processing callback function, such as perf_event_output_backward, perf_event_output_forward, and so on

static struct perf_event *
perf_event_alloc(struct perf_event_attr *attr, ...)
{   
    ...
    if (overflow_handler) {
        event->overflow_handler = overflow_handler;
        event->overflow_handler_context = context;
    } else if (is_write_backward(event)){
        event->overflow_handler = perf_event_output_backward;
        event->overflow_handler_context = NULL;
    } else {
        event->overflow_handler = perf_event_output_forward;
        event->overflow_handler_context = NULL;
    }
    ...
}

When perf_event_open creates an event object and opens it, events that occur on the hardware are ready to execute. The kernel registers the corresponding hardware interrupt handler to be perf_event_nmi_handler.

//file:arch/x86/events/core.c
register_nmi_handler(NMI_LOCAL, perf_event_nmi_handler, 0, "PMI");

This allows the CPU hardware to initiate an interrupt based on the period specified at the time of the perf_event_open call, and the call perf_event_nmi_handler notifies the kernel for sampling processing

//file:arch/x86/events/core.c
static int perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
{    
    ret = x86_pmu.handle_irq(regs);
    ...
}

The terminal handles the function's function call chain through x86_pmu_handle_irq to reach perf_event_overflow. where perf_event_overflow is a key sampling function. It is invoked by both hardware event sampling and software event sampling. It calls the overflow_handler that is registered when perf_event_open. Let's assume that overflow_handler is perf_event_output_forward

void
perf_event_output_forward(struct perf_event *event, ...)
{
    __perf_event_output(event, data, regs, perf_output_begin_forward);
}

True sampling in __perf_event_output

//file:kernel/events/core.c
static __always_inline int
__perf_event_output(struct perf_event *event, ...)
{
    ...
    // 进行采样
    perf_prepare_sample(&header, data, event, regs);
    // 保存到环形缓存区中
    perf_output_sample(&handle, &header, data, event);
}

If PERF_SAMPLE_CALLCHAIN is enabled, not only the name of the currently executing function will be collected, but also the entire call chain will be recorded.

//file:kernel/events/core.c
void perf_prepare_sample(...)
{

    //1.采集IP寄存器，当前正在执行的函数
    if (sample_type & PERF_SAMPLE_IP)
        data->ip = perf_instruction_pointer(regs);

    //2.采集当前的调用链
    if (sample_type & PERF_SAMPLE_CALLCHAIN) {
        int size = 1;

        if (!(sample_type & __PERF_SAMPLE_CALLCHAIN_EARLY))
            data->callchain = perf_callchain(event, regs);

        size += data->callchain->nr;

        header->size += size * sizeof(u64);
    }
    ...
}

In this way, the hardware and kernel work together to complete the sampling of the function call stack. The perf tool can then read this data and process it next.

Third, the working process of FlameGraph

The previous analysis we used perf script is to see the function call stack information comparison is long.

......
59848 main 412201 389052.225443:     676233 cycles:u:
59849             55651b8b5132 caculate+0xd (/data00/home/zhangyanfei.allen/work_test/test07/main)
59850             55651b8b5194 funcC+0xe (/data00/home/zhangyanfei.allen/work_test/test07/main)
59851             55651b8b51d6 main+0x3f (/data00/home/zhangyanfei.allen/work_test/test07/main)
59852             7f8987d6709b __libc_start_main+0xeb (/usr/lib/x86_64-linux-gnu/libc-2.28.so)
59853         41fd89415541f689 [unknown] ([unknown])
......

In the previous step of drawing the flame map, you need to pre-process this data. stackcollapse-perf.pl script counts the number of occurrences of each call stack backtrace and processes the call stack as a single line. The front of the line represents the call stack, followed by the number of times the function has been sampled to run.

# perf script | ../FlameGraph/stackcollapse-perf.pl
main;[unknown];__libc_start_main;main;funcA;funcD;funcE;caculate 554118432
main;[unknown];__libc_start_main;main;funcB;caculate 338716787
main;[unknown];__libc_start_main;main;funcC;caculate 4735052652
main;[unknown];_dl_sysdep_start;dl_main;_dl_map_object_deps 9208
main;[unknown];_dl_sysdep_start;init_tls;[unknown] 29747
main;_dl_map_object;_dl_map_object_from_fd 9147
main;_dl_map_object;_dl_map_object_from_fd;[unknown] 3530
main;_start 273
main;version_check_doit 16041

The output of the above perf script of more than 50,000 lines, after stackcollapse.pl preprocessing, the output is less than 10 lines. The amount of data is greatly simplified. In the FlameGraph project directory, you can see many files beginning with stackcollapse

This is because the sampling output of various languages and tools is different, so naturally different preprocessing scripts are required to parse.

After stackcollapse processing to get the above output, you can start drawing the flame map. flamegraph.pl The script works by plotting the upper row as a column, and the larger the number of samples, the wider the column. In addition, if the same level is the same if the function name is the same, it is merged together. For example, now there are the following data files:

funcA；funcB;funcC 2
funcA; 1
funcD; 1

I can draw a similar flame diagram by hand, as follows:

where funcA occupies a width of 3 because the two rows of records are merged. funcD is not merged, occupying is 1. In addition, funcB and funcC are drawn above A, occupying a width of 2.

summary

Flame graph is a very good tool for analyzing hot spot functions, as long as you are concerned about performance optimization, you should learn to use it to analyze your programs. Our article today not only describes how the flame graph is generated, but also explains how it works under the hood. The generation of the flame map is mainly divided into two steps, one is sampling, but rendering.

In this step of sampling, the main reliance is on the perf_event_open system calls provided by the kernel. This system call is handled internally by a very complex process. Finally, the kernel and hardware work together to periodically record the currently executing function and the complete call link of the function.

In this step of rendering, the script provided by Brendan Gregg will preprocess the perf_data file output by the perf tool, and then render it into an SVG image based on the preprocessed data. The more times the function executes, the wider the width in the SVG picture. We can visually see which functions consume more CPU.

Finally, to add that our flame map is only a sampled rendering, which does not necessarily represent the real situation, but it is sufficient.

The original article is from the public account: develop internal strength cultivation

Original link: https://mp.weixin.qq.com/s/A19RlLhSgbzw8UU4p1TZNA

Dissect the internals of CPU performance flame graph generation