天天看点

Linux perf 1.1、perf_event内核框架Linux perf 1.1、perf_event内核框架0、perf_event的组织1、perf_event初始化2、perf_event_open系统调用3、perf_ioctl

转载 感谢原文作者

Linux perf 1.1、perf_event内核框架

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。 本文链接: https://blog.csdn.net/pwl999/article/details/81200439

为什么有了ftrace又出来一个perf?因为ftrace只管抓trace数据并没有分析,perf在trace数据分析方面做出了很多成果。

在trace数据采集方面,perf复用了ftrace的所有插桩点,并且加入了采样法(硬件PMU)。PMU是一种非常重要的数据采集方法,因为它大部分是硬件的,所以可以做到一些软件做不到的事情,获取到一些底层硬件的信息。

perf的基本包装模型是这样的,对每一个event分配一个对应的perf_event结构。所有对event的操作都是围绕perf_event来展开的:

  • 通过perf_event_open系统调用分配到perf_event以后,会返回一个文件句柄fd,这样这个perf_event结构可以通过read/write/ioctl/mmap通用文件接口来操作。
  • perf_event提供两种类型的trace数据:count和sample。count只是记录了event的发生次数,sample记录了大量信息(比如:IP、ADDR、TID、TIME、CPU、BT)。如果需要使用sample功能,需要给perf_event分配ringbuffer空间,并且把这部分空间通过mmap映射到用户空间。这和定位问题时从粗到细的思路是相符的,首先从counter的比例上找出问题热点在哪个模块,再使用详细记录抓取更多信息来进一步定位。具体分别对应“perf stat”和“perf record/report”命令。
  • perf的开销应该是比ftrace要大的,因为它给每个event都独立一套数据结构perf_event,对应独立的attr和pmu。在数据记录时的开销肯定大于ftrace,但是每个event的ringbuffer是独立的所以也不需要ftrace复杂的ringbuffer操作。perf也有比ftrace开销小的地方,它的sample数据存储的ringbuffer空间会通过mmap映射到到用户态,这样在读取数据的时候就会少一次拷贝。不过perf的设计初衷也不是让成百上千的event同时使用,只会挑出一些event重点debug。

0、perf_event的组织

从上面的描述看per就是一个个perf_event并不复杂,那么复杂在哪里呢?真正的难点在于对event的组织,怎么把全局的event的资源,按照用户的需要分割成cpu维度/task维度。

我们在分析问题的时候,并不是一下子深入到底层event直接来看数据(如果不加区别event记录的是整系统的数据),我们会遵从系统->cpu->进程来分析问题。针对实际的需求,perf使用cpu维度/task维度来组织perf_event。

我们具体来看看perf_event的组织方法:

  • 1、cpu维度:

使用perf_event_context类型的链表来连接本cpu的相关perf_event。这样的链表共有两条(perf_hw_context = 0, perf_sw_context = 1),链表指针存放在per_cpu变量pmu->pmu_cpu_context.ctx中由所有同类型的pmu共享。

Linux perf 1.1、perf_event内核框架Linux perf 1.1、perf_event内核框架0、perf_event的组织1、perf_event初始化2、perf_event_open系统调用3、perf_ioctl
  • 2、task维度:
  • 使用perf_event_context类型的链表来连接本task的相关perf_event。这样的链表共有两条(perf_hw_context = 0, perf_sw_context = 1),链表指针存放在task->perf_event_ctxp[ctxn]变量中。
    Linux perf 1.1、perf_event内核框架Linux perf 1.1、perf_event内核框架0、perf_event的组织1、perf_event初始化2、perf_event_open系统调用3、perf_ioctl
  • 3、perf_event_open()系统调用使用cpu、pid两个参数来指定perf_event的cpu、task维度。
  • pid == 0: event绑定到当前进程;

    pid > 0: event绑定到指定进程;

    pid == -1: event绑定到当前cpu的所有进程。

    cpu >= 0: event绑定到指定cpu;

    cpu == -1: event绑定到所有cpu;

    在同时指定的情况下task维度优先于cpu维度,所以pid、cpu组合起来有以下几种情况:

    组合1:pid >= 0, cpu >= 0。perf_event绑定到task维度的context。task在得到cpu调度运行的时候,context上挂载的本task相关的perf_event也开始运行。但是如果event指定的cpu不等于当前运行的cpu,event不会得到执行,这样就符合了这个组合的含义;

    组合2:pid >= 0, cpu == -1。perf_event绑定到task维度的context。只要task得到调度,该perf_event就会得到执行;

    组合3:pid == -1, cpu >= 0。perf_event绑定到cpu维度的context。只要该cpu运行,该perf_event就会得到执行。目前只有在cpu online的情况下才能绑定perf_event,cpu hotplug支持可能会有问题;

    组合4:pid == -1, cpu == -1。这种组合目前是非法的,相当于整系统所有cpu、所有进程。

  • 4、group leader:
  • cpu/task context使用->event_list链表来连接所有的perf_event。这些perf_event还允许以group的形式来组织,context使用->pinned_groups/flexible_groups链表来连接group leader的perf_event;group leader使用->sibling_list链表来连接所有的group member perf_event。组织形式参考上图。

    group的作用是在read count的时候可以一次性读出group中所有perf_event的count。

    perf_event_open()系统调用使用group_fd参数来指定perf_event的group_leader:>=0指定对于的perf_event为当前group_leader,== -1创建新的group_leader。

    pinned:可以看到group leader被放到两个链表中(->pinned_groups/flexible_groups),attr.pinned=1指定放到高优先级链表->pinned_groups中。

    (具体参见后面perf_install_in_context()的代码解析)

  • 5、perf_task_sched。
  • 对于cpu维度的perf_event来说只要cpu online会一直运行,而对于task维度的perf_event来说只有在task得到调度运行的时候event才能运行。所以在每个cpu上同时只能有一个task维度的perf_evnt得到执行,cpu维度的context使用了pmu->pmu_cpu_context->task_ctx指针来保存当前运行的task context。
    Linux perf 1.1、perf_event内核框架Linux perf 1.1、perf_event内核框架0、perf_event的组织1、perf_event初始化2、perf_event_open系统调用3、perf_ioctl

    perf驱动层的精髓就在于此:在合适的时间合适的开关对应的perf_event。(具体参见后面perf_event_task_sched_in()、perf_event_task_sched_out()的代码解析)

    在单个cpu上,多个任务调度时context/perf_event的开关情况:

    Linux perf 1.1、perf_event内核框架Linux perf 1.1、perf_event内核框架0、perf_event的组织1、perf_event初始化2、perf_event_open系统调用3、perf_ioctl
    单个任务,在多个cpu上调度时context/perf_event的开关情况:
    Linux perf 1.1、perf_event内核框架Linux perf 1.1、perf_event内核框架0、perf_event的组织1、perf_event初始化2、perf_event_open系统调用3、perf_ioctl
  • 6、inherit:
  • inherit属性指定如果perf_event绑定的task创建子进程,event自动的对子进程也进行追踪。这在实际使用时是非常有用的,我们追踪一个app,随后它创建出的子进程/线程都能自动的被追踪。

    父进程中所有attr.inherit=1的event被子进程所继承和复制,在使用PERF_FORMAT_GROUP读event的count值时,会把inherit所有子进程的值累加进来。(具体参见后面perf_event_init_task()、perf_read_group()的代码解析)

    Linux perf 1.1、perf_event内核框架Linux perf 1.1、perf_event内核框架0、perf_event的组织1、perf_event初始化2、perf_event_open系统调用3、perf_ioctl
  • 7、exclusive:
  • 如果pmu有PERF_PMU_CAP_EXCLUSIVE属性,表明它要么只能被绑定为cpu维度、要么只能被绑定为task维度,不能混合绑定。(具体参见后面exclusive_event_init()的代码解析)
  • 8、pmu的数据供给:
  • 每个pmu拥有一个per_cpu的链表,perf_event需要在哪个cpu上获取数据就加入到哪个cpu的链表上。如果event被触发,它会根据当前的运行cpu给对应链表上的所有perf_event推送数据。

    cpu维度的context:this_cpu_ptr(pmu->pmu_cpu_context->ctx)上链接的所有perf_event会根据绑定的pmu,链接到pmu对应的per_cpu的->perf_events链表上。

    task维度的context:this_cpu_ptr(pmu->pmu_cpu_context->task_ctx)上链接的所有perf_event会根据绑定的pmu,链接到pmu对应的per_cpu的->perf_events链表上。perf_event还需要做cpu匹配,符合(event->cpu == -1 || event->cpu == smp_processor_id())条件的event才能链接到pmu上。

    Linux perf 1.1、perf_event内核框架Linux perf 1.1、perf_event内核框架0、perf_event的组织1、perf_event初始化2、perf_event_open系统调用3、perf_ioctl
  • 9、enable_on_exec:
  • perf_event的状态(event->state)典型值有以下3种:

    disable:PERF_EVENT_STATE_OFF = -1, // 如果attr.disabled = 1,event->state的初始值

    enable/inactive:PERF_EVENT_STATE_INACTIVE = 0, // 如果attr.disabled = 0,event->state的初始值

    active:PERF_EVENT_STATE_ACTIVE = 1,

    attr.disabled属性决定了perf_event的初始化状态(disable/enable)。只有perf_event为enable以后才能参与schedule,在schedule过程中perf_event被使能时为active,关闭后恢复成enbale/inactive状态。

    perf_event变成enable状态有3种方法:

    1、attr.disabled = 0;

    2、attr.disabled = 1,创建后使用ioctl的PERF_EVENT_IOC_ENABLE命令来使能;

    3、attr.disabled = 1,attr.enable_on_exec = 1。这样使用execl执行新程序时使能event,这是一种非常巧妙的同步手段;

  • 10、ringbuffer:
  • 如果需要读取perf_event的sample类型的数据,需要先给perf_event分配一个对应的ringbuffer,为了减少开销这个ringbuffer会被mmap映射成用户态地址。

    如上图所示整个ringbuffer空间分成3部分:

    head:size = 1 page。主要用来存储控制数据,指针等等

    data:size = 2^n pages。主要用来存储perf_event的sample数据

    aux data:size = 2^n pages。作用暂时没有看明白

    如果perf_event支持inherit属性,那么它所有的子进程上继承的perf_event的sample数据,都会保存到父perf_event的ringbuffer中。perf_event可以inherit,但是ringbuffer不会重新分配,会共用父event的ringbuffer。

    Linux perf 1.1、perf_event内核框架Linux perf 1.1、perf_event内核框架0、perf_event的组织1、perf_event初始化2、perf_event_open系统调用3、perf_ioctl
  • 11、sample_period/sample_freq:
  • 1、perf_event初始化

    perf_event初始化的时候将各种pmu注册到pmus链表。

    start_kernel() -> perf_event_init():

    void __init perf_event_init(void)
    {
        int ret;
    
        /* (1) 初始化idr,给动态type的pmu来分配 */
        idr_init(&pmu_idr);
    
        /* (2) 初始化per_cpu变量swevent_htable */
        perf_event_init_all_cpus();
        init_srcu_struct(&pmus_srcu);
    
        /* (3) 注册"software" pmu */
        perf_pmu_register(&perf_swevent, "software", PERF_TYPE_SOFTWARE);
        perf_pmu_register(&perf_cpu_clock, NULL, -);
        perf_pmu_register(&perf_task_clock, NULL, -);
    
        /* (4) 注册"tracepoint" pmu */
        perf_tp_register();
        perf_cpu_notifier(perf_cpu_notify);
        idle_notifier_register(&perf_event_idle_nb);
        register_reboot_notifier(&perf_reboot_notifier);
    
        /* (5) 注册"breakpoint" pmu */
        ret = init_hw_breakpoint();
        WARN(ret, "hw_breakpoint initialization failed with: %d", ret);
    
        /* do not patch jump label more than once per second */
        jump_label_rate_limit(&perf_sched_events, HZ);
    
        /*
         * Build time assertion that we keep the data_head at the intended
         * location.  IOW, validation we got the __reserved[] size right.
         */
        BUILD_BUG_ON((offsetof(struct perf_event_mmap_page, data_head))
                 != );
    }
    
    ↓
    
    int perf_pmu_register(struct pmu *pmu, const char *name, int type)
    {
        int cpu, ret;
    
        mutex_lock(&pmus_lock);
        ret = -ENOMEM;
        pmu->pmu_disable_count = alloc_percpu(int);
        if (!pmu->pmu_disable_count)
            goto unlock;
    
        /* (3.1) 如果name = null,则pmu->name=null、pmu->type=-1 */
        pmu->type = -;
        if (!name)
            goto skip_type;
    
        /* (3.1.1) pmu->name = name */
        pmu->name = name;
    
        /* (3.1.2) 如果type < 0,在idr中动态分配值给pmu->type */
        if (type < ) {
            type = idr_alloc(&pmu_idr, pmu, PERF_TYPE_MAX, , GFP_KERNEL);
            if (type < ) {
                ret = type;
                goto free_pdc;
            }
        }
        pmu->type = type;
    
        if (pmu_bus_running) {
            ret = pmu_dev_alloc(pmu);
            if (ret)
                goto free_idr;
        }
    
        /* (3.2) 初始化cpu维度的perf_cpu_context,
           perf_cpu_context的作用是用来把某一维度的perf_event链接到一起 
         */
    skip_type:
        /* (3.2.1) 如果有相同task_ctx_nr类型的pmu已经创建perf_cpu_context结构, 
            直接引用
         */
        pmu->pmu_cpu_context = find_pmu_context(pmu->task_ctx_nr);
        if (pmu->pmu_cpu_context)
            goto got_cpu_context;
    
        ret = -ENOMEM;
        /* (3.2.2) 如果没有人创建本pmu task_ctx_nr类型的perf_cpu_context结构, 
            重新创建
         */
        pmu->pmu_cpu_context = alloc_percpu(struct perf_cpu_context);
        if (!pmu->pmu_cpu_context)
            goto free_dev;
    
        /* (3.2.3) 初始化per_cpu的perf_cpu_context结构 */
        for_each_possible_cpu(cpu) {
            struct perf_cpu_context *cpuctx;
    
            cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
            __perf_event_init_context(&cpuctx->ctx);
            lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
            lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
            cpuctx->ctx.pmu = pmu;
    
            __perf_mux_hrtimer_init(cpuctx, cpu);
    
            cpuctx->unique_pmu = pmu;
        }
    
    got_cpu_context:
        /* (3.3) 给pmu赋值一些默认的操作函数 */
        if (!pmu->start_txn) {
            if (pmu->pmu_enable) {
                /*
                 * If we have pmu_enable/pmu_disable calls, install
                 * transaction stubs that use that to try and batch
                 * hardware accesses.
                 */
                pmu->start_txn  = perf_pmu_start_txn;
                pmu->commit_txn = perf_pmu_commit_txn;
                pmu->cancel_txn = perf_pmu_cancel_txn;
            } else {
                pmu->start_txn  = perf_pmu_nop_txn;
                pmu->commit_txn = perf_pmu_nop_int;
                pmu->cancel_txn = perf_pmu_nop_void;
            }
        }
    
        if (!pmu->pmu_enable) {
            pmu->pmu_enable  = perf_pmu_nop_void;
            pmu->pmu_disable = perf_pmu_nop_void;
        }
    
        if (!pmu->event_idx)
            pmu->event_idx = perf_event_idx_default;
    
        /* (3.4) 最重要的一步:将新的pmu加入到pmus链表中 */
        list_add_rcu(&pmu->entry, &pmus);
        atomic_set(&pmu->exclusive_cnt, );
        ret = ;
    unlock:
        mutex_unlock(&pmus_lock);
    
        return ret;
    
    free_dev:
        device_del(pmu->dev);
        put_device(pmu->dev);
    
    free_idr:
        if (pmu->type >= PERF_TYPE_MAX)
            idr_remove(&pmu_idr, pmu->type);
    
    free_pdc:
        free_percpu(pmu->pmu_disable_count);
        goto unlock;
    }
               
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    另外一个函数perf_event_sysfs_init()会在稍后的device_initcall中,为所有“pmu->name != null”的pmu创建对应的device:
    static int __init perf_event_sysfs_init(void)
    {
        struct pmu *pmu;
        int ret;
    
        mutex_lock(&pmus_lock);
    
        /* (1) 注册pmu_bus */
        ret = bus_register(&pmu_bus);
        if (ret)
            goto unlock;
    
        /* (2) 遍历pmus链表,创建pmu对应的device */
        list_for_each_entry(pmu, &pmus, entry) {
            /* 如果pmu->name = null或者pmu->type < 0,不创建 */
            if (!pmu->name || pmu->type < )
                continue;
    
            ret = pmu_dev_alloc(pmu);
            WARN(ret, "Failed to register pmu: %s, reason %d\n", pmu->name, ret);
        }
    
        /* (3) 设置pmu_bus_running */
        pmu_bus_running = ;
        ret = ;
    
    unlock:
        mutex_unlock(&pmus_lock);
    
        return ret;
    }
    device_initcall(perf_event_sysfs_init);           
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    可以在/sys路径下看到对应的device:
    # ls /sys/bus/event_source/devices/
    armv8_pmuv3 software tracepoint           
    • 1
    • 2

    2、perf_event_open系统调用

    perf_event_open会创建event对应的perf_event结构,按照perf_event_attr参数把perf_event和对应的pmu以及perf_cpu_context(cpu维度/task维度)绑定,最后再把perf_event和perf_fops以及fd绑定,返回fd给系统进行文件操作。

    理解perf_event_open系统调用先理解它的5个参数:

    perf_event_open(struct perf_event_attr attr, pid_t pid, int cpu, int group_fd, unsigned long flags)           
    • 1
    • 参数1、struct perf_event_attr attr。该参数是最复杂也是最重要的参数:
    struct perf_event_attr {
    
        /*
         * Major type: hardware/software/tracepoint/etc.
         */
        /* () 指定pmu的type:
            enum perf_type_id {
                PERF_TYPE_HARDWARE          = ,
                PERF_TYPE_SOFTWARE          = ,
                PERF_TYPE_TRACEPOINT            = ,
                PERF_TYPE_HW_CACHE          = ,
                PERF_TYPE_RAW               = ,
                PERF_TYPE_BREAKPOINT            = ,
    
                PERF_TYPE_MAX,              
            };
         */
        __u32           type;
    
        /*
         * Size of the attr structure, for fwd/bwd compat.
         */
        /* (2) 整个perf_event_attr结构体的size */
        __u32           size;
    
        /*
         * Type specific configuration information.
         */
        /* () 不同type的pmu,config的含义也不同:
            、type = PERF_TYPE_HARDWARE:
                enum perf_hw_id {
                    PERF_COUNT_HW_CPU_CYCLES        = ,
                    PERF_COUNT_HW_INSTRUCTIONS      = ,
                    PERF_COUNT_HW_CACHE_REFERENCES      = ,
                    PERF_COUNT_HW_CACHE_MISSES      = ,
                    PERF_COUNT_HW_BRANCH_INSTRUCTIONS   = ,
                    PERF_COUNT_HW_BRANCH_MISSES     = ,
                    PERF_COUNT_HW_BUS_CYCLES        = ,
                    PERF_COUNT_HW_STALLED_CYCLES_FRONTEND   = ,
                    PERF_COUNT_HW_STALLED_CYCLES_BACKEND    = ,
                    PERF_COUNT_HW_REF_CPU_CYCLES        = ,
    
                    PERF_COUNT_HW_MAX,          
                };
            、type = PERF_TYPE_SOFTWARE:
                enum perf_sw_ids {
                    PERF_COUNT_SW_CPU_CLOCK         = ,
                    PERF_COUNT_SW_TASK_CLOCK        = ,
                    PERF_COUNT_SW_PAGE_FAULTS       = ,
                    PERF_COUNT_SW_CONTEXT_SWITCHES      = ,
                    PERF_COUNT_SW_CPU_MIGRATIONS        = ,
                    PERF_COUNT_SW_PAGE_FAULTS_MIN       = ,
                    PERF_COUNT_SW_PAGE_FAULTS_MAJ       = ,
                    PERF_COUNT_SW_ALIGNMENT_FAULTS      = ,
                    PERF_COUNT_SW_EMULATION_FAULTS      = ,
                    PERF_COUNT_SW_DUMMY         = ,
                    PERF_COUNT_SW_BPF_OUTPUT        = ,
    
                    PERF_COUNT_SW_MAX,          
                };
            、type = PERF_TYPE_TRACEPOINT:
                trace_point对应trace_event的id:“/sys/kernel/debug/tracing/events/x/x/id”
            、type = PERF_TYPE_HW_CACHE:
                enum perf_hw_cache_id {
                    PERF_COUNT_HW_CACHE_L1D         = ,
                    PERF_COUNT_HW_CACHE_L1I         = ,
                    PERF_COUNT_HW_CACHE_LL          = ,
                    PERF_COUNT_HW_CACHE_DTLB        = ,
                    PERF_COUNT_HW_CACHE_ITLB        = ,
                    PERF_COUNT_HW_CACHE_BPU         = ,
                    PERF_COUNT_HW_CACHE_NODE        = ,
    
                    PERF_COUNT_HW_CACHE_MAX,        
                };
         */
        __u64           config;
    
        /* (4) period/freq sample模式的具体数值 */
        union {
            __u64       sample_period;
            __u64       sample_freq;
        };
    
        /* () 在sample数据时,需要保存哪些数据:
            enum perf_event_sample_format {
                PERF_SAMPLE_IP              = U << ,
                PERF_SAMPLE_TID             = U << ,
                PERF_SAMPLE_TIME            = U << ,
                PERF_SAMPLE_ADDR            = U << ,
                PERF_SAMPLE_READ            = U << ,
                PERF_SAMPLE_CALLCHAIN           = U << ,
                PERF_SAMPLE_ID              = U << ,
                PERF_SAMPLE_CPU             = U << ,
                PERF_SAMPLE_PERIOD          = U << ,
                PERF_SAMPLE_STREAM_ID           = U << ,
                PERF_SAMPLE_RAW             = U << ,
                PERF_SAMPLE_BRANCH_STACK        = U << ,
                PERF_SAMPLE_REGS_USER           = U << ,
                PERF_SAMPLE_STACK_USER          = U << ,
                PERF_SAMPLE_WEIGHT          = U << ,
                PERF_SAMPLE_DATA_SRC            = U << ,
                PERF_SAMPLE_IDENTIFIER          = U << ,
                PERF_SAMPLE_TRANSACTION         = U << ,
                PERF_SAMPLE_REGS_INTR           = U << ,
    
                PERF_SAMPLE_MAX = U << , 
            };
         */
        __u64           sample_type;
    
        /* () 在read counter数据时,读取的格式:
             *
             * The format of the data returned by read() on a perf event fd,
             * as specified by attr.read_format:
             *
             * struct read_format {
             *  { u64       value;
             *    { u64     time_enabled; } && PERF_FORMAT_TOTAL_TIME_ENABLED
             *    { u64     time_running; } && PERF_FORMAT_TOTAL_TIME_RUNNING
             *    { u64     id;           } && PERF_FORMAT_ID
             *  } && !PERF_FORMAT_GROUP
             *
             *  { u64       nr;
             *    { u64     time_enabled; } && PERF_FORMAT_TOTAL_TIME_ENABLED
             *    { u64     time_running; } && PERF_FORMAT_TOTAL_TIME_RUNNING
             *    { u64     value;
             *      { u64   id;           } && PERF_FORMAT_ID
             *    }     cntr[nr];
             *  } && PERF_FORMAT_GROUP
             * };
             *
            enum perf_event_read_format {
                PERF_FORMAT_TOTAL_TIME_ENABLED      = U << ,
                PERF_FORMAT_TOTAL_TIME_RUNNING      = U << ,
                PERF_FORMAT_ID              = U << ,
                PERF_FORMAT_GROUP           = U << ,
    
                PERF_FORMAT_MAX = U << ,      
            };
         */
        __u64           read_format;
    
        /* (7) bit标志 */
                    /* () 定义event的初始状态为disable/enable。
                        如果初始被disable,后续可以通过ioctl/prctl来enable。 
                     */
        __u64           disabled       :  , /* off by default        */
    
                    /* (7.2) 如果该标志被设置,event进程对应的子孙后代的子进程都会计入counter */
                    inherit        :  , /* children inherit it   */
    
                    /* (7.3) 如果该标志被设置,event和cpu绑定。(只适用于硬件counter只适用于group leaders) */
                    pinned         :  , /* must always be on PMU */
    
                    /* (7.4) 如果该标志被设置,指定当这个group在CPU上时,它应该是唯一使用CPU计数器的group */
                    exclusive      :  , /* only group on PMU     */
    
                    /* () exclude_user/exclude_kernel/exclude_hv/exclude_idle这几个标志用来标识,
                        不要记录对应场景的数据
                     */
                    exclude_user   :  , /* don't count user      */
                    exclude_kernel :  , /* ditto kernel          */
                    exclude_hv     :  , /* ditto hypervisor      */
                    exclude_idle   :  , /* don't count when idle */
    
                    /* (7.6) 允许记录PROT_EXEC mmap操作 */
                    mmap           :  , /* include mmap data     */
    
                    /* (7.7) 允许记录进程创建时的comm数据 */
                    comm           :  , /* include comm data     */
    
                    /* (7.8) 确定sample模式 = freq/period */
                    freq           :  , /* use freq, not period  */
    
    
                    inherit_stat   :  , /* per task counts       */
                    enable_on_exec :  , /* next exec enables     */
                    task           :  , /* trace fork/exit       */
                    watermark      :  , /* wakeup_watermark      */
                    /*
                     * precise_ip:
                     *
                     *   - SAMPLE_IP can have arbitrary skid
                     *   - SAMPLE_IP must have constant skid
                     *   - SAMPLE_IP requested to have  skid
                     *   - SAMPLE_IP must have  skid
                     *
                     *  See also PERF_RECORD_MISC_EXACT_IP
                     */
                    precise_ip     :  , /* skid constraint       */
                    mmap_data      :  , /* non-exec mmap data    */
                    sample_id_all  :  , /* sample_type all events */
    
                    exclude_host   :  , /* don't count in host   */
                    exclude_guest  :  , /* don't count in guest  */
    
                    exclude_callchain_kernel : , /* exclude kernel callchains */
                    exclude_callchain_user   : , /* exclude user callchains */
                    mmap2          :  , /* include mmap with inode data     */
                    comm_exec      :  , /* flag comm events that are due to an exec */
                    use_clockid    :  , /* use @clockid for time fields */
                    context_switch :  , /* context switch data */
                    constraint_duplicate : ,
    
                    __reserved_1   : ;
    
        union {
            __u32       wakeup_events;    /* wakeup every n events */
            __u32       wakeup_watermark; /* bytes before wakeup   */
        };
    
        __u32           bp_type;
        union {
            __u64       bp_addr;
            __u64       config1; /* extension of config */
        };
        union {
            __u64       bp_len;
            __u64       config2; /* extension of config1 */
        };
        __u64   branch_sample_type; /* enum perf_branch_sample_type */
    
        /*
         * Defines set of user regs to dump on samples.
         * See asm/perf_regs.h for details.
         */
        __u64   sample_regs_user;
    
        /*
         * Defines size of the user stack to dump on samples.
         */
        __u32   sample_stack_user;
    
        __s32   clockid;
        /*
         * Defines set of regs to dump for each sample
         * state captured on:
         *  - precise = : PMU interrupt
         *  - precise > : sampled instruction
         *
         * See asm/perf_regs.h for details.
         */
        __u64   sample_regs_intr;
    
        /*
         * Wakeup watermark for AUX area
         */
        __u32   aux_watermark;
        __u32   __reserved_2;   /* align to __u64 */
    }           
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    • 157
    • 158
    • 159
    • 160
    • 161
    • 162
    • 163
    • 164
    • 165
    • 166
    • 167
    • 168
    • 169
    • 170
    • 171
    • 172
    • 173
    • 174
    • 175
    • 176
    • 177
    • 178
    • 179
    • 180
    • 181
    • 182
    • 183
    • 184
    • 185
    • 186
    • 187
    • 188
    • 189
    • 190
    • 191
    • 192
    • 193
    • 194
    • 195
    • 196
    • 197
    • 198
    • 199
    • 200
    • 201
    • 202
    • 203
    • 204
    • 205
    • 206
    • 207
    • 208
    • 209
    • 210
    • 211
    • 212
    • 213
    • 214
    • 215
    • 216
    • 217
    • 218
    • 219
    • 220
    • 221
    • 222
    • 223
    • 224
    • 225
    • 226
    • 227
    • 228
    • 229
    • 230
    • 231
    • 232
    • 233
    • 234
    • 235
    • 236
    • 237
    • 238
    • 239
    • 240
    • 241
    • 242
    • 243
    • 244
    • 245
    • 246
    • 247
    • 248
    • 249
    • 250
    • 参数2、pid_t pid:

    pid == 0: event绑定到当前进程;

    pid > 0: event绑定到指定进程;

    pid < 0: event绑定到当前cpu的所有进程。

  • 参数3、int cpu:
  • cpu >= 0: event绑定到指定cpu;

    cpu == -1: event绑定到所有cpu;

    (注意’pid == -1’、’cpu == -1’同时存在是非法的)

  • 参数4、int group_fd:
  • group_fd = -1:创建一个新的group leader;

    group_fd > 0:加入到之前创建的group leader中。

  • 参数5、unsigned long flags:
  • #define PERF_FLAG_FD_NO_GROUP       (1UL << 0)
    #define PERF_FLAG_FD_OUTPUT     (1UL << 1)
    #define PERF_FLAG_PID_CGROUP        (1UL << 2) /* pid=cgroup id, per-cpu mode only */
    #define PERF_FLAG_FD_CLOEXEC        (1UL << 3) /* O_CLOEXEC */           
    • 1
    • 2
    • 3
    • 4
    perf_event_open()函数的完全解析如下:
    SYSCALL_DEFINE5(perf_event_open,
            struct perf_event_attr __user *, attr_uptr,
            pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
    {
        struct perf_event *group_leader = NULL, *output_event = NULL;
        struct perf_event *event, *sibling;
        struct perf_event_attr attr;
        struct perf_event_context *ctx, *uninitialized_var(gctx);
        struct file *event_file = NULL;
        struct fd group = {NULL, };
        struct task_struct *task = NULL;
        struct pmu *pmu;
        int event_fd;
        int move_group = ;
        int err;
        int f_flags = O_RDWR;
        int cgroup_fd = -;
    
        /* (1) 一系列的合法性判断和准备工作 */
        /* for future expandability... */
        if (flags & ~PERF_FLAG_ALL)
            return -EINVAL;
    
        /* (1.1) 权限判断 */
        if (perf_paranoid_any() && !capable(CAP_SYS_ADMIN))
            return -EACCES;
    
        /* (1.2) 拷贝用户态的perf_event_attr到内核态 */
        err = perf_copy_attr(attr_uptr, &attr);
        if (err)
            return err;
    
        if (attr.constraint_duplicate || attr.__reserved_1)
            return -EINVAL;
    
        if (!attr.exclude_kernel) {
            if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
                return -EACCES;
        }
    
        /* (1.3) 如果sample是freq mode,sample_freq的合法性判断 */
        if (attr.freq) {
            if (attr.sample_freq > sysctl_perf_event_sample_rate)
                return -EINVAL;
        /* (1.4) 如果sample是period mode,sample_period的合法性判断 */
        } else {
            if (attr.sample_period & (ULL << ))
                return -EINVAL;
        }
    
        /*
         * In cgroup mode, the pid argument is used to pass the fd
         * opened to the cgroup directory in cgroupfs. The cpu argument
         * designates the cpu on which to monitor threads from that
         * cgroup.
         */
        if ((flags & PERF_FLAG_PID_CGROUP) && (pid == - || cpu == -))
            return -EINVAL;
    
        if (flags & PERF_FLAG_FD_CLOEXEC)
            f_flags |= O_CLOEXEC;
    
        /* (1.5) 当前进程获取一个新的fd编号  */
        event_fd = get_unused_fd_flags(f_flags);
        if (event_fd < )
            return event_fd;
    
        /* (1.6) 如果当前event需要加入到指定的group leader中,获取到: 
            group_fd对应的fd结构 和 perf_event结构
         */
        if (group_fd != -) {
            err = perf_fget_light(group_fd, &group);
            if (err)
                goto err_fd;
            group_leader = group.file->private_data;
            if (flags & PERF_FLAG_FD_OUTPUT)
                output_event = group_leader;
            if (flags & PERF_FLAG_FD_NO_GROUP)
                group_leader = NULL;
        }
    
        /*
         * Take the group_leader's group_leader_mutex before observing
         * anything in the group leader that leads to changes in ctx,
         * many of which may be changing on another thread.
         * In particular, we want to take this lock before deciding
         * whether we need to move_group.
         */
        if (group_leader)
            mutex_lock(&group_leader->group_leader_mutex);
    
        /* (1.7) 找到pid对应的task_struct结构 */
        if (pid != - && !(flags & PERF_FLAG_PID_CGROUP)) {
            task = find_lively_task_by_vpid(pid);
            if (IS_ERR(task)) {
                err = PTR_ERR(task);
                goto err_group_fd;
            }
        }
    
        /* (1.8) 如果是加入到group leader,需要两者的attr.inherit属性一致 */
        if (task && group_leader &&
            group_leader->attr.inherit != attr.inherit) {
            err = -EINVAL;
            goto err_task;
        }
    
        get_online_cpus();
    
        if (task) {
            err = mutex_lock_interruptible(&task->signal->cred_guard_mutex);
            if (err)
                goto err_cpus;
    
            /*
             * Reuse ptrace permission checks for now.
             *
             * We must hold cred_guard_mutex across this and any potential
             * perf_install_in_context() call for this new event to
             * serialize against exec() altering our credentials (and the
             * perf_event_exit_task() that could imply).
             */
            err = -EACCES;
            if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
                goto err_cred;
        }
    
        /* (1.9) 创建cgroup fd,这和之前的group leader不一样 */
        if (flags & PERF_FLAG_PID_CGROUP)
            cgroup_fd = pid;
    
        /* (2) 重头戏:根据传入的参数,分配perf_event结构并初始化 */
        event = perf_event_alloc(&attr, cpu, task, group_leader, NULL,
                     NULL, NULL, cgroup_fd);
        if (IS_ERR(event)) {
            err = PTR_ERR(event);
            goto err_cred;
        }
    
        /* (3.1) 如果attr指定需要sample数据,但是pmu没有中断能力,返回出错(主要是针对硬件pmu) */
        if (is_sampling_event(event)) {
            if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
                err = -ENOTSUPP;
                goto err_alloc;
            }
        }
    
        /*
         * Special case software events and allow them to be part of
         * any hardware group.
         */
        pmu = event->pmu;
    
        /* (3.2) 如果用户指定时钟源,把event->clock设置为用户指定值 */
        if (attr.use_clockid) {
            err = perf_event_set_clock(event, attr.clockid);
            if (err)
                goto err_alloc;
        }
    
        /* (3.3) 如果event和group_leader的pmu type不一致的处理 */
        if (group_leader &&
            (is_software_event(event) != is_software_event(group_leader))) {
            /* (3.3.1) pmu type: event == software, group_leader != software 
                把event加入到group_leader中
             */
            if (is_software_event(event)) {
                /*
                 * If event and group_leader are not both a software
                 * event, and event is, then group leader is not.
                 *
                 * Allow the addition of software events to !software
                 * groups, this is safe because software events never
                 * fail to schedule.
                 */
                pmu = group_leader->pmu;
    
            /* (3.3.2) pmu type: event != software, group_leader == software 
                尝试把整个group移入到hardware context中
             */
            } else if (is_software_event(group_leader) &&
                   (group_leader->group_flags & PERF_GROUP_SOFTWARE)) {
                /*
                 * In case the group is a pure software group, and we
                 * try to add a hardware event, move the whole group to
                 * the hardware context.
                 */
                move_group = ;
            }
        }
    
        /*
         * Get the target context (task or percpu):
         */
        /* (4) get到perf_event_context,根据perf_event类型得到cpu维度/task维度的context:
            如果pid=-1即task=NULL,获得cpu维度的context,即pmu注册时根据pmu->task_ctx_nr分配的pmu->pmu_cpu_context->ctx
            如果pid>=0即task!=NULL,获得task维度的context,即task->perf_event_ctxp[ctxn],如果为空则重新创建
         */
        ctx = find_get_context(pmu, task, event);
        if (IS_ERR(ctx)) {
            err = PTR_ERR(ctx);
            goto err_alloc;
        }
    
        /* (5.1) event需要加入到group_leader,如果(pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE),出错返回  */
        if ((pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) && group_leader) {
            err = -EBUSY;
            goto err_context;
        }
    
        /*
         * Look up the group leader (we will attach this event to it):
         */
        /* (5.2) event需要加入到group_leader,对一些条件进行合法性判断  */
        if (group_leader) {
            err = -EINVAL;
    
            /*
             * Do not allow a recursive hierarchy (this new sibling
             * becoming part of another group-sibling):
             */
            /* (5.2.1) 不允许递归的->group_leader */
            if (group_leader->group_leader != group_leader)
                goto err_context;
    
            /* All events in a group should have the same clock */
            /* (5.2.2) event加入gruop,需要时钟源一致 */
            if (group_leader->clock != event->clock)
                goto err_context;
    
            /*
             * Do not allow to attach to a group in a different
             * task or CPU context:
             */
            /* (5.2.3) event加入gruop,需要task/cpu的context一致 */
            if (move_group) {
                /*
                 * Make sure we're both on the same task, or both
                 * per-cpu events.
                 */
                if (group_leader->ctx->task != ctx->task)
                    goto err_context;
    
                /*
                 * Make sure we're both events for the same CPU;
                 * grouping events for different CPUs is broken; since
                 * you can never concurrently schedule them anyhow.
                 */
                if (group_leader->cpu != event->cpu)
                    goto err_context;
            } else {
                if (group_leader->ctx != ctx)
                    goto err_context;
            }
    
            /*
             * Only a group leader can be exclusive or pinned
             */
            /* (5.2.4) 只有group才能设置exclusive/pinned属性 */
            if (attr.exclusive || attr.pinned)
                goto err_context;
        }
    
        /* (5.3) 设置output_event */
        if (output_event) {
            err = perf_event_set_output(event, output_event);
            if (err)
                goto err_context;
        }
    
        /* (6) 分配perf_event对应的file结构: 
            file->private_data = event; // file和event结构链接在一起
            file->f_op = perf_fops;    // file的文件操作函数集
            后续会把fd和file链接到一起:fd_install(event_fd, event_file);
         */
        event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
                        f_flags);
        if (IS_ERR(event_file)) {
            err = PTR_ERR(event_file);
            event_file = NULL;
            goto err_context;
        }
    
        if (move_group) {
            gctx = group_leader->ctx;
            mutex_lock_double(&gctx->mutex, &ctx->mutex);
        } else {
            mutex_lock(&ctx->mutex);
        }
    
        /* (7.1) 根据attr计算:event->read_size、event->header_size、event->id_header_size 
            并判断是否有超长
         */
        if (!perf_event_validate_size(event)) {
            err = -E2BIG;
            goto err_locked;
        }
    
        /*
         * Must be under the same ctx::mutex as perf_install_in_context(),
         * because we need to serialize with concurrent event creation.
         */
        /* (7.2) 如果是排他性event:(pmu->capabilities & PERF_PMU_CAP_EXCLUSIVE) 
            检查context链表中现有的event是否允许新的event插入
         */
        if (!exclusive_event_installable(event, ctx)) {
            /* exclusive and group stuff are assumed mutually exclusive */
            WARN_ON_ONCE(move_group);
    
            err = -EBUSY;
            goto err_locked;
        }
    
        WARN_ON_ONCE(ctx->parent_ctx);
    
        /*
         * This is the point on no return; we cannot fail hereafter. This is
         * where we start modifying current state.
         */
    
        /* (8) 如果group和当前event的pmu type不一致,
            尝试更改context到当前event
         */
        if (move_group) {
            /*
             * See perf_event_ctx_lock() for comments on the details
             * of swizzling perf_event::ctx.
             */
            /* (8.1) 把group_leader从原有的context中remove */
            perf_remove_from_context(group_leader, false);
    
            /* (8.2) 把所有group_leader的子event从原有的context中remove */
            list_for_each_entry(sibling, &group_leader->sibling_list,
                        group_entry) {
                perf_remove_from_context(sibling, false);
                put_ctx(gctx);
            }
    
            /*
             * Wait for everybody to stop referencing the events through
             * the old lists, before installing it on new lists.
             */
            synchronize_rcu();
    
            /*
             * Install the group siblings before the group leader.
             *
             * Because a group leader will try and install the entire group
             * (through the sibling list, which is still in-tact), we can
             * end up with siblings installed in the wrong context.
             *
             * By installing siblings first we NO-OP because they're not
             * reachable through the group lists.
             */
            /* (8.3) 把所有group_leader的子event安装到新的context中 */
            list_for_each_entry(sibling, &group_leader->sibling_list,
                        group_entry) {
                perf_event__state_init(sibling);
                perf_install_in_context(ctx, sibling, sibling->cpu);
                get_ctx(ctx);
            }
    
            /*
             * Removing from the context ends up with disabled
             * event. What we want here is event in the initial
             * startup state, ready to be add into new context.
             */
            /* (8.4) 把group_leader安装到新的context中 */
            perf_event__state_init(group_leader);
            perf_install_in_context(ctx, group_leader, group_leader->cpu);
            get_ctx(ctx);
    
            /*
             * Now that all events are installed in @ctx, nothing
             * references @gctx anymore, so drop the last reference we have
             * on it.
             */
            put_ctx(gctx);
        }
    
        /*
         * Precalculate sample_data sizes; do while holding ctx::mutex such
         * that we're serialized against further additions and before
         * perf_install_in_context() which is the point the event is active and
         * can use these values.
         */
        /* (9.1) 重新计算:event->read_size、event->header_size、event->id_header_size */
        perf_event__header_size(event);
        perf_event__id_header_size(event);
    
        /* (10) 把event安装到context当中 */
        perf_install_in_context(ctx, event, event->cpu);
        perf_unpin_context(ctx);
    
        if (move_group)
            mutex_unlock(&gctx->mutex);
        mutex_unlock(&ctx->mutex);
        if (group_leader)
            mutex_unlock(&group_leader->group_leader_mutex);
    
        if (task) {
            mutex_unlock(&task->signal->cred_guard_mutex);
            put_task_struct(task);
        }
    
        put_online_cpus();
    
        event->owner = current;
    
        /* (9.2) 把当前进程创建的所有event,加入到current->perf_event_list链表中 */
        mutex_lock(&current->perf_event_mutex);
        list_add_tail(&event->owner_entry, &current->perf_event_list);
        mutex_unlock(&current->perf_event_mutex);
    
        /*
         * Drop the reference on the group_event after placing the
         * new event on the sibling_list. This ensures destruction
         * of the group leader will find the pointer to itself in
         * perf_group_detach().
         */
        fdput(group);
        /* (9.3) 把fd和file进行链接 */
        fd_install(event_fd, event_file);
        return event_fd;
    
    err_locked:
        if (move_group)
            mutex_unlock(&gctx->mutex);
        mutex_unlock(&ctx->mutex);
    /* err_file: */
        fput(event_file);
    err_context:
        perf_unpin_context(ctx);
        put_ctx(ctx);
    err_alloc:
        /*
         * If event_file is set, the fput() above will have called ->release()
         * and that will take care of freeing the event.
         */
        if (!event_file)
            free_event(event);
    err_cred:
        if (task)
            mutex_unlock(&task->signal->cred_guard_mutex);
    err_cpus:
        put_online_cpus();
    err_task:
        if (task)
            put_task_struct(task);
    err_group_fd:
        if (group_leader)
            mutex_unlock(&group_leader->group_leader_mutex);
        fdput(group);
    err_fd:
        put_unused_fd(event_fd);
        return err;
    }
    
    |→
    
    static struct perf_event *
    perf_event_alloc(struct perf_event_attr *attr, int cpu,
             struct task_struct *task,
             struct perf_event *group_leader,
             struct perf_event *parent_event,
             perf_overflow_handler_t overflow_handler,
             void *context, int cgroup_fd)
    {
        struct pmu *pmu;
        struct perf_event *event;
        struct hw_perf_event *hwc;
        long err = -EINVAL;
    
        if ((unsigned)cpu >= nr_cpu_ids) {
            if (!task || cpu != -)
                return ERR_PTR(-EINVAL);
        }
    
        /* (2.1) 分配perf_event空间 */
        event = kzalloc(sizeof(*event), GFP_KERNEL);
        if (!event)
            return ERR_PTR(-ENOMEM);
    
        /*
         * Single events are their own group leaders, with an
         * empty sibling list:
         *
        /* (2.2) 如果group_fd == -1,那么group_leader = self */
        if (!group_leader)
            group_leader = event;
    
        mutex_init(&event->group_leader_mutex);
        mutex_init(&event->child_mutex);
        INIT_LIST_HEAD(&event->child_list);
    
        INIT_LIST_HEAD(&event->group_entry);
        INIT_LIST_HEAD(&event->event_entry);
        INIT_LIST_HEAD(&event->sibling_list);
        INIT_LIST_HEAD(&event->rb_entry);
        INIT_LIST_HEAD(&event->active_entry);
        INIT_LIST_HEAD(&event->drv_configs);
        INIT_HLIST_NODE(&event->hlist_entry);
    
    
        init_waitqueue_head(&event->waitq);
        init_irq_work(&event->pending, perf_pending_event);
    
        mutex_init(&event->mmap_mutex);
    
        atomic_long_set(&event->refcount, );
        event->cpu      = cpu;
        event->attr     = *attr;
        event->group_leader = group_leader;
        event->pmu      = NULL;
        event->oncpu        = -;
    
        event->parent       = parent_event;
    
        event->ns       = get_pid_ns(task_active_pid_ns(current));
        event->id       = atomic64_inc_return(&perf_event_id);
    
        event->state        = PERF_EVENT_STATE_INACTIVE;
    
        /* (2.3) 如果task != null */
        if (task) {
            event->attach_state = PERF_ATTACH_TASK;
            /*
             * XXX pmu::event_init needs to know what task to account to
             * and we cannot use the ctx information because we need the
             * pmu before we get a ctx.
             */
            event->hw.target = task;
        }
    
        event->clock = &local_clock;
        if (parent_event)
            event->clock = parent_event->clock;
    
        if (!overflow_handler && parent_event) {
            overflow_handler = parent_event->overflow_handler;
            context = parent_event->overflow_handler_context;
        }
    
        event->overflow_handler = overflow_handler;
        event->overflow_handler_context = context;
    
        /* (2.4) 根据attr.disabled的值来设置event的初始化值:
            event->state =  PERF_EVENT_STATE_OFF/PERF_EVENT_STATE_INACTIVE
         */
        perf_event__state_init(event);
    
        pmu = NULL;
    
        /* (2.5) 根据event的attr->freq和attr->sample_period/sample_freq来初始化event->hw:
            hwc->sample_period
            hwc->last_period
            hwc->period_left
         */
        hwc = &event->hw;
        hwc->sample_period = attr->sample_period;
        if (attr->freq && attr->sample_freq)
            hwc->sample_period = ;
        hwc->last_period = hwc->sample_period;
    
        local64_set(&hwc->period_left, hwc->sample_period);
    
        /*
         * we currently do not support PERF_FORMAT_GROUP on inherited events
         */
        /* (2.6) inherit时不支持PERF_FORMAT_GROUP */
        if (attr->inherit && (attr->read_format & PERF_FORMAT_GROUP))
            goto err_ns;
    
        /* (2.7) 如果!(attr.sample_type & PERF_SAMPLE_BRANCH_STACK):
            则attr.branch_sample_type = 0
         */
        if (!has_branch_stack(event))
            event->attr.branch_sample_type = ;
    
        /* (2.8) 如果(flags & PERF_FLAG_PID_CGROUP):
            将当前event加入cgroup 
         */
        if (cgroup_fd != -) {
            err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
            if (err)
                goto err_ns;
        }
    
        /* (2.9) 至关重要的一步:
            根据attr.type在pmus链表中找到对应的pmu,
            并且调用pmu->event_init(event)来初始化event
         */
        pmu = perf_init_event(event);
        if (!pmu)
            goto err_ns;
        else if (IS_ERR(pmu)) {
            err = PTR_ERR(pmu);
            goto err_ns;
        }
    
        /* (2.10) exclusive类型pmu的event的处理 */
        err = exclusive_event_init(event);
        if (err)
            goto err_pmu;
    
        if (!event->parent) {
            if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN) {
                err = get_callchain_buffers();
                if (err)
                    goto err_per_task;
            }
        }
    
        /* symmetric to unaccount_event() in _free_event() */
        /* (2.11) 关于event操作的一些统计 */
        account_event(event);
    
        return event;
    
    err_per_task:
        exclusive_event_destroy(event);
    
    err_pmu:
        if (event->destroy)
            event->destroy(event);
        module_put(pmu->module);
    err_ns:
        if (is_cgroup_event(event))
            perf_detach_cgroup(event);
        if (event->ns)
            put_pid_ns(event->ns);
        kfree(event);
    
        return ERR_PTR(err);
    }
    
    |→
    
    static struct perf_event_context *
    find_get_context(struct pmu *pmu, struct task_struct *task,
            struct perf_event *event)
    {
        struct perf_event_context *ctx, *clone_ctx = NULL;
        struct perf_cpu_context *cpuctx;
        void *task_ctx_data = NULL;
        unsigned long flags;
        int ctxn, err;
        int cpu = event->cpu;
    
        /* (4.1) 如果task=null即pid=-1,获取cpu维度的context */
        if (!task) {
            /* Must be root to operate on a CPU event: */
            /* (4.1.1) 权限判断 */
            if (event->owner != EVENT_OWNER_KERNEL && perf_paranoid_cpu() &&
                !capable(CAP_SYS_ADMIN))
                return ERR_PTR(-EACCES);
    
            /*
             * We could be clever and allow to attach a event to an
             * offline CPU and activate it when the CPU comes up, but
             * that's for later.
             */
            /* (4.1.2) attr指定的cpu是否online */
            if (!cpu_online(cpu))
                return ERR_PTR(-ENODEV);
    
            /* (4.1.3) 根据cpu获取到对应pmu的cpu维度context:per_cpu_ptr(pmu->pmu_cpu_context, cpu)->ctx */
            cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
            ctx = &cpuctx->ctx;
            get_ctx(ctx);
            ++ctx->pin_count;
    
            return ctx;
        }
    
        /* (4.2) 如果task!=null即pid>=0,获取task维度的context */
    
        err = -EINVAL;
        ctxn = pmu->task_ctx_nr;
        if (ctxn < )
            goto errout;
    
        /* (4.2.1) 部分架构context需要分配ctx->task_ctx_data */
        if (event->attach_state & PERF_ATTACH_TASK_DATA) {
            task_ctx_data = kzalloc(pmu->task_ctx_size, GFP_KERNEL);
            if (!task_ctx_data) {
                err = -ENOMEM;
                goto errout;
            }
        }
    
    retry:
        /* (4.2.2) 获取task维度的context:task->perf_event_ctxp[pmu->task_ctx_nr]
            如果此前无人创建过此context,则分配空间创建
         */
        ctx = perf_lock_task_context(task, ctxn, &flags);
        /* (4.2.2.1) task此context已经创建,则使用现成的 */
        if (ctx) {
            clone_ctx = unclone_ctx(ctx);
            ++ctx->pin_count;
    
            if (task_ctx_data && !ctx->task_ctx_data) {
                ctx->task_ctx_data = task_ctx_data;
                task_ctx_data = NULL;
            }
            raw_spin_unlock_irqrestore(&ctx->lock, flags);
    
            if (clone_ctx)
                put_ctx(clone_ctx);
        /* (4.2.2.2) 否则,重新创建task->perf_event_ctxp[pmu->task_ctx_nr] */
        } else {
            ctx = alloc_perf_context(pmu, task);
            err = -ENOMEM;
            if (!ctx)
                goto errout;
    
            if (task_ctx_data) {
                ctx->task_ctx_data = task_ctx_data;
                task_ctx_data = NULL;
            }
    
            err = ;
            mutex_lock(&task->perf_event_mutex);
            /*
             * If it has already passed perf_event_exit_task().
             * we must see PF_EXITING, it takes this mutex too.
             */
            if (task->flags & PF_EXITING)
                err = -ESRCH;
            else if (task->perf_event_ctxp[ctxn])
                err = -EAGAIN;
            else {
                get_ctx(ctx);
                ++ctx->pin_count;
                rcu_assign_pointer(task->perf_event_ctxp[ctxn], ctx);
            }
            mutex_unlock(&task->perf_event_mutex);
    
            if (unlikely(err)) {
                put_ctx(ctx);
    
                if (err == -EAGAIN)
                    goto retry;
                goto errout;
            }
        }
    
        kfree(task_ctx_data);
        return ctx;
    
    errout:
        kfree(task_ctx_data);
        return ERR_PTR(err);
    }
    
    |→
    
    static void
    perf_install_in_context(struct perf_event_context *ctx,
                struct perf_event *event,
                int cpu)
    {
        struct task_struct *task = ctx->task;
    
        lockdep_assert_held(&ctx->mutex);
    
        /* (10.1) context赋值给event->ctx */
        event->ctx = ctx;
        if (event->cpu != -)
            event->cpu = cpu;
    
        /* (10.2) 如果是cpu维度的context,
            使用cpu同步机制来调用指定的cpu上运行__perf_install_in_context() 
            绑定context、event
         */
        if (!task) {
            /*
             * Per cpu events are installed via an smp call and
             * the install is always successful.
             */
            cpu_function_call(cpu, __perf_install_in_context, event);
            return;
        }
    
    retry:
        /* (10.3) 如果是task维度的context,且task当前正在running
            使用cpu同步机制调用指定的task的运行cpu(即task_cpu(p))上运行__perf_install_in_context()
            绑定context、event
         */
        if (!task_function_call(task, __perf_install_in_context, event))
            return;
    
        raw_spin_lock_irq(&ctx->lock);
        /*
         * If we failed to find a running task, but find the context active now
         * that we've acquired the ctx->lock, retry.
         */
        if (ctx->is_active) {
            raw_spin_unlock_irq(&ctx->lock);
            /*
             * Reload the task pointer, it might have been changed by
             * a concurrent perf_event_context_sched_out().
             */
            task = ctx->task;
            goto retry;
        }
    
        /*
         * Since the task isn't running, its safe to add the event, us holding
         * the ctx->lock ensures the task won't get scheduled in.
         */
        /* (10.4) 如果是task维度的context,但是task当前不在runnning
            可以安全的绑定event和context
         */
        add_event_to_ctx(event, ctx);
        raw_spin_unlock_irq(&ctx->lock);
    }
    
    ||→
    
    static void add_event_to_ctx(struct perf_event *event,
                       struct perf_event_context *ctx)
    {
        u64 tstamp = perf_event_time(event);
    
        /* (10.4.1) 将event加入到context的相关链表 */
        list_add_event(event, ctx);
    
        /* (10.4.2) 将event加入到group_leader的链表 */
        perf_group_attach(event);
        event->tstamp_enabled = tstamp;
        event->tstamp_running = tstamp;
        event->tstamp_stopped = tstamp;
    }
    
    |||→
    
    static void
    list_add_event(struct perf_event *event, struct perf_event_context *ctx)
    {
        WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
    
        /* (10.4.1.1) 设置event->attach_state的PERF_ATTACH_CONTEXT */
        event->attach_state |= PERF_ATTACH_CONTEXT;
    
        /*
         * If we're a stand alone event or group leader, we go to the context
         * list, group events are kept attached to the group so that
         * perf_group_detach can, at all times, locate all siblings.
         */
        /* (10.4.1.2) 如果event是group_leader 
            则将其event->group_entry加入到顶级group链表:ctx->flexible_groups/pinned_groups
            ctx->flexible_groups/pinned_groups链表只链接group_leader的event
         */
        if (event->group_leader == event) {
            struct list_head *list;
    
            if (is_software_event(event))
                event->group_flags |= PERF_GROUP_SOFTWARE;
    
            list = ctx_group_list(event, ctx);
            list_add_tail(&event->group_entry, list);
        }
    
        if (is_cgroup_event(event))
            ctx->nr_cgroups++;
    
        /* (10.4.1.3) 将event->event_entry加入到链表:ctx->event_list 
            ctx->event_list链表链接context下所有的event
         */
        list_add_rcu(&event->event_entry, &ctx->event_list);
        ctx->nr_events++;
        if (event->attr.inherit_stat)
            ctx->nr_stat++;
    
        ctx->generation++;
    }
    
    |||→
    
    static void perf_group_attach(struct perf_event *event)
    {
        struct perf_event *group_leader = event->group_leader, *pos;
    
        /*
         * We can have double attach due to group movement in perf_event_open.
         */
        /* (10.4.2.1) 设置event->attach_state的PERF_ATTACH_GROUP */
        if (event->attach_state & PERF_ATTACH_GROUP)
            return;
    
        event->attach_state |= PERF_ATTACH_GROUP;
    
        /* (10.4.2.2) 如果event本身就是group_leader,不需要继续操作 */
        if (group_leader == event)
            return;
    
        WARN_ON_ONCE(group_leader->ctx != event->ctx);
    
        /* (10.4.2.3) move_group的处理? */
        if (group_leader->group_flags & PERF_GROUP_SOFTWARE &&
                !is_software_event(event))
            group_leader->group_flags &= ~PERF_GROUP_SOFTWARE;
    
        /* (10.4.2.4) 把event->group_entry加入到group_leader->sibling_list链表 */
        list_add_tail(&event->group_entry, &group_leader->sibling_list);
        group_leader->nr_siblings++;
    
        /* (10.4.2.5) 重新计算group_leader的event->read_size、event->header_size */
        perf_event__header_size(group_leader);
    
        /* (10.4.2.6) 重新计算group_leader所有子event的event->read_size、event->header_size */
        list_for_each_entry(pos, &group_leader->sibling_list, group_entry)
            perf_event__header_size(pos);
    }
               
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    • 157
    • 158
    • 159
    • 160
    • 161
    • 162
    • 163
    • 164
    • 165
    • 166
    • 167
    • 168
    • 169
    • 170
    • 171
    • 172
    • 173
    • 174
    • 175
    • 176
    • 177
    • 178
    • 179
    • 180
    • 181
    • 182
    • 183
    • 184
    • 185
    • 186
    • 187
    • 188
    • 189
    • 190
    • 191
    • 192
    • 193
    • 194
    • 195
    • 196
    • 197
    • 198
    • 199
    • 200
    • 201
    • 202
    • 203
    • 204
    • 205
    • 206
    • 207
    • 208
    • 209
    • 210
    • 211
    • 212
    • 213
    • 214
    • 215
    • 216
    • 217
    • 218
    • 219
    • 220
    • 221
    • 222
    • 223
    • 224
    • 225
    • 226
    • 227
    • 228
    • 229
    • 230
    • 231
    • 232
    • 233
    • 234
    • 235
    • 236
    • 237
    • 238
    • 239
    • 240
    • 241
    • 242
    • 243
    • 244
    • 245
    • 246
    • 247
    • 248
    • 249
    • 250
    • 251
    • 252
    • 253
    • 254
    • 255
    • 256
    • 257
    • 258
    • 259
    • 260
    • 261
    • 262
    • 263
    • 264
    • 265
    • 266
    • 267
    • 268
    • 269
    • 270
    • 271
    • 272
    • 273
    • 274
    • 275
    • 276
    • 277
    • 278
    • 279
    • 280
    • 281
    • 282
    • 283
    • 284
    • 285
    • 286
    • 287
    • 288
    • 289
    • 290
    • 291
    • 292
    • 293
    • 294
    • 295
    • 296
    • 297
    • 298
    • 299
    • 300
    • 301
    • 302
    • 303
    • 304
    • 305
    • 306
    • 307
    • 308
    • 309
    • 310
    • 311
    • 312
    • 313
    • 314
    • 315
    • 316
    • 317
    • 318
    • 319
    • 320
    • 321
    • 322
    • 323
    • 324
    • 325
    • 326
    • 327
    • 328
    • 329
    • 330
    • 331
    • 332
    • 333
    • 334
    • 335
    • 336
    • 337
    • 338
    • 339
    • 340
    • 341
    • 342
    • 343
    • 344
    • 345
    • 346
    • 347
    • 348
    • 349
    • 350
    • 351
    • 352
    • 353
    • 354
    • 355
    • 356
    • 357
    • 358
    • 359
    • 360
    • 361
    • 362
    • 363
    • 364
    • 365
    • 366
    • 367
    • 368
    • 369
    • 370
    • 371
    • 372
    • 373
    • 374
    • 375
    • 376
    • 377
    • 378
    • 379
    • 380
    • 381
    • 382
    • 383
    • 384
    • 385
    • 386
    • 387
    • 388
    • 389
    • 390
    • 391
    • 392
    • 393
    • 394
    • 395
    • 396
    • 397
    • 398
    • 399
    • 400
    • 401
    • 402
    • 403
    • 404
    • 405
    • 406
    • 407
    • 408
    • 409
    • 410
    • 411
    • 412
    • 413
    • 414
    • 415
    • 416
    • 417
    • 418
    • 419
    • 420
    • 421
    • 422
    • 423
    • 424
    • 425
    • 426
    • 427
    • 428
    • 429
    • 430
    • 431
    • 432
    • 433
    • 434
    • 435
    • 436
    • 437
    • 438
    • 439
    • 440
    • 441
    • 442
    • 443
    • 444
    • 445
    • 446
    • 447
    • 448
    • 449
    • 450
    • 451
    • 452
    • 453
    • 454
    • 455
    • 456
    • 457
    • 458
    • 459
    • 460
    • 461
    • 462
    • 463
    • 464
    • 465
    • 466
    • 467
    • 468
    • 469
    • 470
    • 471
    • 472
    • 473
    • 474
    • 475
    • 476
    • 477
    • 478
    • 479
    • 480
    • 481
    • 482
    • 483
    • 484
    • 485
    • 486
    • 487
    • 488
    • 489
    • 490
    • 491
    • 492
    • 493
    • 494
    • 495
    • 496
    • 497
    • 498
    • 499
    • 500
    • 501
    • 502
    • 503
    • 504
    • 505
    • 506
    • 507
    • 508
    • 509
    • 510
    • 511
    • 512
    • 513
    • 514
    • 515
    • 516
    • 517
    • 518
    • 519
    • 520
    • 521
    • 522
    • 523
    • 524
    • 525
    • 526
    • 527
    • 528
    • 529
    • 530
    • 531
    • 532
    • 533
    • 534
    • 535
    • 536
    • 537
    • 538
    • 539
    • 540
    • 541
    • 542
    • 543
    • 544
    • 545
    • 546
    • 547
    • 548
    • 549
    • 550
    • 551
    • 552
    • 553
    • 554
    • 555
    • 556
    • 557
    • 558
    • 559
    • 560
    • 561
    • 562
    • 563
    • 564
    • 565
    • 566
    • 567
    • 568
    • 569
    • 570
    • 571
    • 572
    • 573
    • 574
    • 575
    • 576
    • 577
    • 578
    • 579
    • 580
    • 581
    • 582
    • 583
    • 584
    • 585
    • 586
    • 587
    • 588
    • 589
    • 590
    • 591
    • 592
    • 593
    • 594
    • 595
    • 596
    • 597
    • 598
    • 599
    • 600
    • 601
    • 602
    • 603
    • 604
    • 605
    • 606
    • 607
    • 608
    • 609
    • 610
    • 611
    • 612
    • 613
    • 614
    • 615
    • 616
    • 617
    • 618
    • 619
    • 620
    • 621
    • 622
    • 623
    • 624
    • 625
    • 626
    • 627
    • 628
    • 629
    • 630
    • 631
    • 632
    • 633
    • 634
    • 635
    • 636
    • 637
    • 638
    • 639
    • 640
    • 641
    • 642
    • 643
    • 644
    • 645
    • 646
    • 647
    • 648
    • 649
    • 650
    • 651
    • 652
    • 653
    • 654
    • 655
    • 656
    • 657
    • 658
    • 659
    • 660
    • 661
    • 662
    • 663
    • 664
    • 665
    • 666
    • 667
    • 668
    • 669
    • 670
    • 671
    • 672
    • 673
    • 674
    • 675
    • 676
    • 677
    • 678
    • 679
    • 680
    • 681
    • 682
    • 683
    • 684
    • 685
    • 686
    • 687
    • 688
    • 689
    • 690
    • 691
    • 692
    • 693
    • 694
    • 695
    • 696
    • 697
    • 698
    • 699
    • 700
    • 701
    • 702
    • 703
    • 704
    • 705
    • 706
    • 707
    • 708
    • 709
    • 710
    • 711
    • 712
    • 713
    • 714
    • 715
    • 716
    • 717
    • 718
    • 719
    • 720
    • 721
    • 722
    • 723
    • 724
    • 725
    • 726
    • 727
    • 728
    • 729
    • 730
    • 731
    • 732
    • 733
    • 734
    • 735
    • 736
    • 737
    • 738
    • 739
    • 740
    • 741
    • 742
    • 743
    • 744
    • 745
    • 746
    • 747
    • 748
    • 749
    • 750
    • 751
    • 752
    • 753
    • 754
    • 755
    • 756
    • 757
    • 758
    • 759
    • 760
    • 761
    • 762
    • 763
    • 764
    • 765
    • 766
    • 767
    • 768
    • 769
    • 770
    • 771
    • 772
    • 773
    • 774
    • 775
    • 776
    • 777
    • 778
    • 779
    • 780
    • 781
    • 782
    • 783
    • 784
    • 785
    • 786
    • 787
    • 788
    • 789
    • 790
    • 791
    • 792
    • 793
    • 794
    • 795
    • 796
    • 797
    • 798
    • 799
    • 800
    • 801
    • 802
    • 803
    • 804
    • 805
    • 806
    • 807
    • 808
    • 809
    • 810
    • 811
    • 812
    • 813
    • 814
    • 815
    • 816
    • 817
    • 818
    • 819
    • 820
    • 821
    • 822
    • 823
    • 824
    • 825
    • 826
    • 827
    • 828
    • 829
    • 830
    • 831
    • 832
    • 833
    • 834
    • 835
    • 836
    • 837
    • 838
    • 839
    • 840
    • 841
    • 842
    • 843
    • 844
    • 845
    • 846
    • 847
    • 848
    • 849
    • 850
    • 851
    • 852
    • 853
    • 854
    • 855
    • 856
    • 857
    • 858
    • 859
    • 860
    • 861
    • 862
    • 863
    • 864
    • 865
    • 866
    • 867
    • 868
    • 869
    • 870
    • 871
    • 872
    • 873
    • 874
    • 875
    • 876
    • 877
    • 878
    • 879
    • 880
    • 881
    • 882
    • 883
    • 884
    • 885
    • 886
    • 887
    • 888
    • 889
    • 890
    • 891
    • 892
    • 893
    • 894
    • 895
    • 896
    • 897
    • 898
    • 899
    • 900
    • 901
    • 902
    • 903
    • 904
    • 905
    • 906
    • 907
    • 908
    • 909
    • 910
    • 911
    • 912
    • 913
    • 914
    • 915
    • 916
    经过perf_event_open()调用以后返回perf_event对应的fd,后续的文件操作对应perf_fops:
    static const struct file_operations perf_fops = {
        .llseek         = no_llseek,
        .release        = perf_release,
        .read           = perf_read,
        .poll           = perf_poll,
        .unlocked_ioctl     = perf_ioctl,
        .compat_ioctl       = perf_compat_ioctl,
        .mmap           = perf_mmap,
        .fasync         = perf_fasync,
    };           
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    后续会对其重点的函数perf_read()、perf_ioctl()、perf_mmap()一一进行解析。

    2.1、inherit

    进程通过task contex(task->perf_event_ctxp[ctxn])挂载了很多面向task的perf_event,如果event支持inherit属性,当进程创建子进程时需要给子进程创建task context和继承的event。

    copy_process() -> perf_event_init_task() -> perf_event_init_context() -> inherit_task_group() -> inherit_group() -> inherit_event():

    int perf_event_init_task(struct task_struct *child)
    {
        int ctxn, ret;
    
        /* () 初始化child task的perf_event相关结构 */
        memset(child->perf_event_ctxp, , sizeof(child->perf_event_ctxp));
        mutex_init(&child->perf_event_mutex);
        INIT_LIST_HEAD(&child->perf_event_list);
    
        /* () 进程通过task contex(task->perf_event_ctxp[ctxn])挂载了很多面向task的perf_event,
            如果event支持inherit属性,当进程创建子进程时需要给子进程创建task context和继承的event
         */
        for_each_task_context_nr(ctxn) {
            ret = perf_event_init_context(child, ctxn);
            if (ret) {
                perf_event_free_task(child);
                return ret;
            }
        }
    
        return ;
    }
    
    |→
    
    static int perf_event_init_context(struct task_struct *child, int ctxn)
    {
        struct perf_event_context *child_ctx, *parent_ctx;
        struct perf_event_context *cloned_ctx;
        struct perf_event *event;
        struct task_struct *parent = current;
        int inherited_all = ;
        unsigned long flags;
        int ret = ;
    
        /* () 父进程即为当前进程(current) */
        if (likely(!parent->perf_event_ctxp[ctxn]))
            return ;
    
        /*
         * If the parent's context is a clone, pin it so it won't get
         * swapped under us.
         */
        parent_ctx = perf_pin_task_context(parent, ctxn);
        if (!parent_ctx)
            return ;
    
        /*
         * No need to check if parent_ctx != NULL here; since we saw
         * it non-NULL earlier, the only reason for it to become NULL
         * is if we exit, and since we're currently in the middle of
         * a fork we can't be exiting at the same time.
         */
    
        /*
         * Lock the parent list. No need to lock the child - not PID
         * hashed yet and not running, so nobody can access it.
         */
        mutex_lock(&parent_ctx->mutex);
    
        /*
         * We dont have to disable NMIs - we are only looking at
         * the list, not manipulating it:
         */
        /* () 遍历父进程context上的高优先级group leader event链表:parent_ctx->pinned_groups,
            在子进程上复制需要inherit的event
         */
        list_for_each_entry(event, &parent_ctx->pinned_groups, group_entry) {
            ret = inherit_task_group(event, parent, parent_ctx,
                         child, ctxn, &inherited_all);
            if (ret)
                break;
        }
    
        /*
         * We can't hold ctx->lock when iterating the ->flexible_group list due
         * to allocations, but we need to prevent rotation because
         * rotate_ctx() will change the list from interrupt context.
         */
        raw_spin_lock_irqsave(&parent_ctx->lock, flags);
        parent_ctx->rotate_disable = ;
        raw_spin_unlock_irqrestore(&parent_ctx->lock, flags);
    
        /* () 遍历父进程context上的低优先级group leader event链表:parent_ctx->flexible_groups,
            在子进程上复制需要inherit的event
         */
        list_for_each_entry(event, &parent_ctx->flexible_groups, group_entry) {
            ret = inherit_task_group(event, parent, parent_ctx,
                         child, ctxn, &inherited_all);
            if (ret)
                break;
        }
    
        raw_spin_lock_irqsave(&parent_ctx->lock, flags);
        parent_ctx->rotate_disable = ;
    
        child_ctx = child->perf_event_ctxp[ctxn];
    
        /* () 如果inherited_all>,表示父进程挂载的所有event都是inherit属性,都被子进程复制继承, 
            那么给子进程的task context->parent_ctx赋值为父进程的context
         */
        if (child_ctx && inherited_all) {
            /*
             * Mark the child context as a clone of the parent
             * context, or of whatever the parent is a clone of.
             *
             * Note that if the parent is a clone, the holding of
             * parent_ctx->lock avoids it from being uncloned.
             */
            cloned_ctx = parent_ctx->parent_ctx;
            if (cloned_ctx) {
                child_ctx->parent_ctx = cloned_ctx;
                child_ctx->parent_gen = parent_ctx->parent_gen;
            } else {
                child_ctx->parent_ctx = parent_ctx;
                child_ctx->parent_gen = parent_ctx->generation;
            }
            get_ctx(child_ctx->parent_ctx);
        }
    
        raw_spin_unlock_irqrestore(&parent_ctx->lock, flags);
        mutex_unlock(&parent_ctx->mutex);
    
        perf_unpin_context(parent_ctx);
        put_ctx(parent_ctx);
    
        return ret;
    }
    
    ||→
    
    static int
    inherit_task_group(struct perf_event *event, struct task_struct *parent,
               struct perf_event_context *parent_ctx,
               struct task_struct *child, int ctxn,
               int *inherited_all)
    {
        int ret;
        struct perf_event_context *child_ctx;
    
        /* () 如果event不支持inherit,直接返回 */
        if (!event->attr.inherit) {
            *inherited_all = ;
            return ;
        }
    
        /* () 如果子进程的context为空,分配创建 */
        child_ctx = child->perf_event_ctxp[ctxn];
        if (!child_ctx) {
            /*
             * This is executed from the parent task context, so
             * inherit events that have been marked for cloning.
             * First allocate and initialize a context for the
             * child.
             */
    
            child_ctx = alloc_perf_context(parent_ctx->pmu, child);
            if (!child_ctx)
                return -ENOMEM;
    
            child->perf_event_ctxp[ctxn] = child_ctx;
        }
    
        /* () 子进程对父进程的event进行复制继承 */
        ret = inherit_group(event, parent, parent_ctx,
                    child, child_ctx);
    
        if (ret)
            *inherited_all = ;
    
        return ret;
    }
    
    |||→
    
    static int inherit_group(struct perf_event *parent_event,
              struct task_struct *parent,
              struct perf_event_context *parent_ctx,
              struct task_struct *child,
              struct perf_event_context *child_ctx)
    {
        struct perf_event *leader;
        struct perf_event *sub;
        struct perf_event *child_ctr;
    
        /* () 对group_leader event进行复制继承, 
            新建group leader的继承者是新的group leader
         */
        leader = inherit_event(parent_event, parent, parent_ctx,
                     child, NULL, child_ctx);
        if (IS_ERR(leader))
            return PTR_ERR(leader);
    
        /* () 对group_leader下面的子event进行复制继承: 
            如果父进程group leader下面有树成员,他会把这棵树给子进程也复制一份,
            ---oops:这里有个疑问,如果父进程group成员不是group leader所在的父进程,但是inherit会给子进程复制一个event,这样合理吗?
         */
        list_for_each_entry(sub, &parent_event->sibling_list, group_entry) {
            child_ctr = inherit_event(sub, parent, parent_ctx,
                            child, leader, child_ctx);
            if (IS_ERR(child_ctr))
                return PTR_ERR(child_ctr);
        }
        return ;
    }
    
    ||||→
    
    static struct perf_event *
    inherit_event(struct perf_event *parent_event,
              struct task_struct *parent,
              struct perf_event_context *parent_ctx,
              struct task_struct *child,
              struct perf_event *group_leader,
              struct perf_event_context *child_ctx)
    {
        enum perf_event_active_state parent_state = parent_event->state;
        struct perf_event *child_event;
        unsigned long flags;
    
        /*
         * Instead of creating recursive hierarchies of events,
         * we link inherited events back to the original parent,
         * which has a filp for sure, which we use as the reference
         * count:
         */
        /* () 获得子进程新建event的parent event,
            如果是多级子进程,把所有子进程的event都挂在原始parent下面,而不是一级一级的挂载
         */
        if (parent_event->parent)
            parent_event = parent_event->parent;
    
        /* () 对group_leader event进行复制,重新分配初始化: 
            和父event创建不同,这指定了子event的parent:event->parent = parent_event;
            随后子event也会被加入到父event的parent_event->child_list链表中
         */
        child_event = perf_event_alloc(&parent_event->attr,
                           parent_event->cpu,
                           child,
                           group_leader, parent_event,
                           NULL, NULL, -);
        if (IS_ERR(child_event))
            return child_event;
    
        if (is_orphaned_event(parent_event) ||
            !atomic_long_inc_not_zero(&parent_event->refcount)) {
            free_event(child_event);
            return NULL;
        }
    
        get_ctx(child_ctx);
    
        /*
         * Make the child state follow the state of the parent event,
         * not its attr.disabled bit.  We hold the parent's mutex,
         * so we won't race with perf_event_{en, dis}able_family.
         */
        /* () 如果父event的状态已经enable,子event的状态也为enable */
        if (parent_state >= PERF_EVENT_STATE_INACTIVE)
            child_event->state = PERF_EVENT_STATE_INACTIVE;
        else
            child_event->state = PERF_EVENT_STATE_OFF;
    
        if (parent_event->attr.freq) {
            u64 sample_period = parent_event->hw.sample_period;
            struct hw_perf_event *hwc = &child_event->hw;
    
            hwc->sample_period = sample_period;
            hwc->last_period   = sample_period;
    
            local64_set(&hwc->period_left, sample_period);
        }
    
        child_event->ctx = child_ctx;
        child_event->overflow_handler = parent_event->overflow_handler;
        child_event->overflow_handler_context
            = parent_event->overflow_handler_context;
    
        /*
         * Precalculate sample_data sizes
         */
        perf_event__header_size(child_event);
        perf_event__id_header_size(child_event);
    
        /*
         * Link it up in the child's context:
         */
        /* () 将新的event加入到子进程的context中 */
        raw_spin_lock_irqsave(&child_ctx->lock, flags);
        add_event_to_ctx(child_event, child_ctx);
        raw_spin_unlock_irqrestore(&child_ctx->lock, flags);
    
        /*
         * Link this into the parent event's child list
         */
        /* () 将子event加入到父event的->child_list链表中 */
        WARN_ON_ONCE(parent_event->ctx->parent_ctx);
        mutex_lock(&parent_event->child_mutex);
        list_add_tail(&child_event->child_list, &parent_event->child_list);
        mutex_unlock(&parent_event->child_mutex);
    
        return child_event;
    }
               
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    • 157
    • 158
    • 159
    • 160
    • 161
    • 162
    • 163
    • 164
    • 165
    • 166
    • 167
    • 168
    • 169
    • 170
    • 171
    • 172
    • 173
    • 174
    • 175
    • 176
    • 177
    • 178
    • 179
    • 180
    • 181
    • 182
    • 183
    • 184
    • 185
    • 186
    • 187
    • 188
    • 189
    • 190
    • 191
    • 192
    • 193
    • 194
    • 195
    • 196
    • 197
    • 198
    • 199
    • 200
    • 201
    • 202
    • 203
    • 204
    • 205
    • 206
    • 207
    • 208
    • 209
    • 210
    • 211
    • 212
    • 213
    • 214
    • 215
    • 216
    • 217
    • 218
    • 219
    • 220
    • 221
    • 222
    • 223
    • 224
    • 225
    • 226
    • 227
    • 228
    • 229
    • 230
    • 231
    • 232
    • 233
    • 234
    • 235
    • 236
    • 237
    • 238
    • 239
    • 240
    • 241
    • 242
    • 243
    • 244
    • 245
    • 246
    • 247
    • 248
    • 249
    • 250
    • 251
    • 252
    • 253
    • 254
    • 255
    • 256
    • 257
    • 258
    • 259
    • 260
    • 261
    • 262
    • 263
    • 264
    • 265
    • 266
    • 267
    • 268
    • 269
    • 270
    • 271
    • 272
    • 273
    • 274
    • 275
    • 276
    • 277
    • 278
    • 279
    • 280
    • 281
    • 282
    • 283
    • 284
    • 285
    • 286
    • 287
    • 288
    • 289
    • 290
    • 291
    • 292
    • 293
    • 294
    • 295
    • 296
    • 297
    • 298
    • 299
    • 300
    • 301
    • 302
    • 303
    • 304

    2.2、perf task sched

    task context上链接的perf_event需要跟随task进程调度一起动态启动停止,在task得到调度时相关perf_event开始工作,在task被其他任务调度出去时相关perf_event停止工作。

    为了支持这种行为,在task switch时调用perf的回调函数:perf_event_task_sched_out()/perf_event_task_sched_in()

    context_switch() -> prepare_task_switch() -> perf_event_task_sched_out():

    static inline void perf_event_task_sched_out(struct task_struct *prev,
                             struct task_struct *next)
    {
        /* (1) "software" pmu 子类型为“context-switches”的数据采集点 */
        perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, , );
    
        /* (2) 随着taskswitch,和task关联的相关event需要停止工作:即sched_out 
            perf_sched_events.key:是进行perf_sched_in/out的开关,在event分配(perf_event_alloc() -> account_event())和释放(_free_event() -> unaccount_event())时操作
         */
        if (static_key_false(&perf_sched_events.key))
            __perf_event_task_sched_out(prev, next);
    }
    
    |→
    
    void __perf_event_task_sched_out(struct task_struct *task,
                     struct task_struct *next)
    {
        int ctxn;
    
        /* (2.1) 回调函数pmu->sched_task的调用点 */
        if (__this_cpu_read(perf_sched_cb_usages))
            perf_pmu_sched_task(task, next, false);
    
        /* (2.2) */
        if (atomic_read(&nr_switch_events))
            perf_event_switch(task, next, false);
    
        /* (2.3) 遍历task->[perf_event_ctxp]中的context, 
            逐个关闭context链接中的每个perf_event
         */
        for_each_task_context_nr(ctxn)
            perf_event_context_sched_out(task, ctxn, next);
    
        /*
         * if cgroup events exist on this CPU, then we need
         * to check if we have to switch out PMU state.
         * cgroup event are system-wide mode only
         */
        /* (2.4) task cgroup相关sched_out */
        if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
            perf_cgroup_sched_out(task, next);
    }
    
    ||→
    
    static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
                         struct task_struct *next)
    {
        struct perf_event_context *ctx = task->perf_event_ctxp[ctxn];
        struct perf_event_context *next_ctx;
        struct perf_event_context *parent, *next_parent;
        struct perf_cpu_context *cpuctx;
        int do_switch = ;
    
        if (likely(!ctx))
            return;
    
        cpuctx = __get_cpu_context(ctx);
        if (!cpuctx->task_ctx)
            return;
    
        rcu_read_lock();
        next_ctx = next->perf_event_ctxp[ctxn];
        if (!next_ctx)
            goto unlock;
    
        parent = rcu_dereference(ctx->parent_ctx);
        next_parent = rcu_dereference(next_ctx->parent_ctx);
    
        /* If neither context have a parent context; they cannot be clones. */
        if (!parent && !next_parent)
            goto unlock;
    
        /* (2.3.1) 如果curr task和next task的context刚好是parent关系,我们使用快捷路径来切换任务, 
            context的parent关系,只有在创建子进程,且所有的父进程event都有inherit属性被子进程全部复制继承,
         */
        if (next_parent == ctx || next_ctx == parent || next_parent == parent) {
            /*
             * Looks like the two contexts are clones, so we might be
             * able to optimize the context switch.  We lock both
             * contexts and check that they are clones under the
             * lock (including re-checking that neither has been
             * uncloned in the meantime).  It doesn't matter which
             * order we take the locks because no other cpu could
             * be trying to lock both of these tasks.
             */
            raw_spin_lock(&ctx->lock);
            raw_spin_lock_nested(&next_ctx->lock, SINGLE_DEPTH_NESTING);
            if (context_equiv(ctx, next_ctx)) {
                /*
                 * XXX do we need a memory barrier of sorts
                 * wrt to rcu_dereference() of perf_event_ctxp
                 */
                task->perf_event_ctxp[ctxn] = next_ctx;
                next->perf_event_ctxp[ctxn] = ctx;
                ctx->task = next;
                next_ctx->task = task;
    
                swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
    
                do_switch = ;
    
                perf_event_sync_stat(ctx, next_ctx);
            }
            raw_spin_unlock(&next_ctx->lock);
            raw_spin_unlock(&ctx->lock);
        }
    unlock:
        rcu_read_unlock();
    
        /* (2.3.2) 慢速路径的task context切换:
         */
        if (do_switch) {
            raw_spin_lock(&ctx->lock);
            ctx_sched_out(ctx, cpuctx, EVENT_ALL);
            cpuctx->task_ctx = NULL;
            raw_spin_unlock(&ctx->lock);
        }
    }
    
    |||→
    
    static void ctx_sched_out(struct perf_event_context *ctx,
                  struct perf_cpu_context *cpuctx,
                  enum event_type_t event_type)
    {
        struct perf_event *event;
        int is_active = ctx->is_active;
    
        ctx->is_active &= ~event_type;
        if (likely(!ctx->nr_events))
            return;
    
        update_context_time(ctx);
        update_cgrp_time_from_cpuctx(cpuctx);
        if (!ctx->nr_active)
            return;
    
        perf_pmu_disable(ctx->pmu);
        /* (2.3.2.1) 遍历context的高优先级event链表->pinned_groups:
            任务已经被切出,关联的所有event需要停工
         */
        if ((is_active & EVENT_PINNED) && (event_type & EVENT_PINNED)) {
            list_for_each_entry(event, &ctx->pinned_groups, group_entry)
                group_sched_out(event, cpuctx, ctx);
        }
    
        /* (2.3.2.2) 遍历context的低优先级event链表->flexible_groups:
            任务已经被切出,关联的所有event需要停工
         */
        if ((is_active & EVENT_FLEXIBLE) && (event_type & EVENT_FLEXIBLE)) {
            list_for_each_entry(event, &ctx->flexible_groups, group_entry)
                group_sched_out(event, cpuctx, ctx);
        }
        perf_pmu_enable(ctx->pmu);
    }
    
    ||||→
    
    static void
    group_sched_out(struct perf_event *group_event,
            struct perf_cpu_context *cpuctx,
            struct perf_event_context *ctx)
    {
        struct perf_event *event;
        int state = group_event->state;
    
        /* (2.3.2.2.1) 对group leader event进行停工 */
        event_sched_out(group_event, cpuctx, ctx);
    
        /*
         * Schedule out siblings (if any):
         */
        /* (2.3.2.2.2) 对group leader下的子event进行停工 */
        list_for_each_entry(event, &group_event->sibling_list, group_entry)
            event_sched_out(event, cpuctx, ctx);
    
        if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive)
            cpuctx->exclusive = ;
    }
    
    |||||→
    
    static void
    event_sched_out(struct perf_event *event,
              struct perf_cpu_context *cpuctx,
              struct perf_event_context *ctx)
    {
        u64 tstamp = perf_event_time(event);
        u64 delta;
    
        WARN_ON_ONCE(event->ctx != ctx);
        lockdep_assert_held(&ctx->lock);
    
        /*
         * An event which could not be activated because of
         * filter mismatch still needs to have its timings
         * maintained, otherwise bogus information is return
         * via read() for time_enabled, time_running:
         */
        if (event->state == PERF_EVENT_STATE_INACTIVE
            && !event_filter_match(event)) {
            delta = tstamp - event->tstamp_stopped;
            event->tstamp_running += delta;
            event->tstamp_stopped = tstamp;
        }
    
        /* (2.3.2.2.1.1) 如果当前event不是开工状态(PERF_EVENT_STATE_ACTIVE)直接返回 */
        if (event->state != PERF_EVENT_STATE_ACTIVE)
            return;
    
        perf_pmu_disable(event->pmu);
    
        event->tstamp_stopped = tstamp;
    
        /* (2.3.2.2.1.2) 调用event的停工函数:pmu->del() */
        event->pmu->del(event, );
        event->oncpu = -;
    
        /* (2.3.2.2.1.3) 状态设置为停工:PERF_EVENT_STATE_INACTIVE */
        event->state = PERF_EVENT_STATE_INACTIVE;
        if (event->pending_disable) {
            event->pending_disable = ;
            event->state = PERF_EVENT_STATE_OFF;
        }
    
        if (!is_software_event(event))
            cpuctx->active_oncpu--;
        if (!--ctx->nr_active)
            perf_event_ctx_deactivate(ctx);
        if (event->attr.freq && event->attr.sample_freq)
            ctx->nr_freq--;
        if (event->attr.exclusive || !cpuctx->active_oncpu)
            cpuctx->exclusive = ;
    
        if (is_orphaned_child(event))
            schedule_orphans_remove(ctx);
    
        perf_pmu_enable(event->pmu);
    }           
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    • 157
    • 158
    • 159
    • 160
    • 161
    • 162
    • 163
    • 164
    • 165
    • 166
    • 167
    • 168
    • 169
    • 170
    • 171
    • 172
    • 173
    • 174
    • 175
    • 176
    • 177
    • 178
    • 179
    • 180
    • 181
    • 182
    • 183
    • 184
    • 185
    • 186
    • 187
    • 188
    • 189
    • 190
    • 191
    • 192
    • 193
    • 194
    • 195
    • 196
    • 197
    • 198
    • 199
    • 200
    • 201
    • 202
    • 203
    • 204
    • 205
    • 206
    • 207
    • 208
    • 209
    • 210
    • 211
    • 212
    • 213
    • 214
    • 215
    • 216
    • 217
    • 218
    • 219
    • 220
    • 221
    • 222
    • 223
    • 224
    • 225
    • 226
    • 227
    • 228
    • 229
    • 230
    • 231
    • 232
    • 233
    • 234
    • 235
    • 236
    • 237
    • 238
    • 239
    • 240
    • 241
    context_switch() -> finish_task_switch() -> perf_event_task_sched_in():
    static inline void perf_event_task_sched_in(struct task_struct *prev,
                            struct task_struct *task)
    {
        /* (1) 新进程上相关的event需要开工 */
        if (static_key_false(&perf_sched_events.key))
            __perf_event_task_sched_in(prev, task);
    
        if (perf_sw_migrate_enabled() && task->sched_migrated) {
            struct pt_regs *regs = this_cpu_ptr(&__perf_regs[]);
    
            perf_fetch_caller_regs(regs);
            ___perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, , regs, );
            task->sched_migrated = ;
        }
    }
    
    |→
    
    void __perf_event_task_sched_in(struct task_struct *prev,
                    struct task_struct *task)
    {
        struct perf_event_context *ctx;
        int ctxn;
    
        /* (1.1) 遍历task->[perf_event_ctxp]中的context, 
            逐个打开context链接中的每个perf_event
         */
        for_each_task_context_nr(ctxn) {
            ctx = task->perf_event_ctxp[ctxn];
            if (likely(!ctx))
                continue;
    
            perf_event_context_sched_in(ctx, task);
        }
        /*
         * if cgroup events exist on this CPU, then we need
         * to check if we have to switch in PMU state.
         * cgroup event are system-wide mode only
         */
        if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
            perf_cgroup_sched_in(prev, task);
    
        if (atomic_read(&nr_switch_events))
            perf_event_switch(task, prev, true);
    
        if (__this_cpu_read(perf_sched_cb_usages))
            perf_pmu_sched_task(prev, task, true);
    }
    
    ||→
    
    static void perf_event_context_sched_in(struct perf_event_context *ctx,
                        struct task_struct *task)
    {
        struct perf_cpu_context *cpuctx;
    
        cpuctx = __get_cpu_context(ctx);
        if (cpuctx->task_ctx == ctx)
            return;
    
        perf_ctx_lock(cpuctx, ctx);
        perf_pmu_disable(ctx->pmu);
        /*
         * We want to keep the following priority order:
         * cpu pinned (that don't need to move), task pinned,
         * cpu flexible, task flexible.
         */
        /* (1.1.1) 将新task context对应的本cpu维度的ctx->flexible_groups停工
            ctx->pinned_groups这时不会停工,就体现出优先级了?
         */
        cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
    
        /* (1.1.2) 切换本cpu的当前task context指针:cpuctx->task_ctx */
        if (ctx->nr_events)
            cpuctx->task_ctx = ctx;
    
        /* (1.1.3) 将新对应的本cpu维度的ctx->flexible_groups开工 
            将task context对应的ctx->flexible_groups、ctx->pinned_groups开工
         */
        perf_event_sched_in(cpuctx, cpuctx->task_ctx, task);
    
        perf_pmu_enable(ctx->pmu);
        perf_ctx_unlock(cpuctx, ctx);
    }
    
    |||→
    
    static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
                    struct perf_event_context *ctx,
                    struct task_struct *task)
    {
        cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
        if (ctx)
            ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
        cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
        if (ctx)
            ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
    }
    
    ||||→
    
    static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
                     enum event_type_t event_type,
                     struct task_struct *task)
    {
        struct perf_event_context *ctx = &cpuctx->ctx;
    
        ctx_sched_in(ctx, cpuctx, event_type, task);
    }
    
    |||||→
    
    static void
    ctx_sched_in(struct perf_event_context *ctx,
             struct perf_cpu_context *cpuctx,
             enum event_type_t event_type,
             struct task_struct *task)
    {
        u64 now;
        int is_active = ctx->is_active;
    
        ctx->is_active |= event_type;
        if (likely(!ctx->nr_events))
            return;
    
        now = perf_clock();
        ctx->timestamp = now;
        perf_cgroup_set_timestamp(task, ctx);
        /*
         * First go through the list and put on any pinned groups
         * in order to give them the best chance of going on.
         */
        if (!(is_active & EVENT_PINNED) && (event_type & EVENT_PINNED))
            ctx_pinned_sched_in(ctx, cpuctx);
    
        /* Then walk through the lower prio flexible groups */
        if (!(is_active & EVENT_FLEXIBLE) && (event_type & EVENT_FLEXIBLE))
            ctx_flexible_sched_in(ctx, cpuctx);
    }
    
    ||||||→
    
    static void
    ctx_pinned_sched_in(struct perf_event_context *ctx,
                struct perf_cpu_context *cpuctx)
    {
        struct perf_event *event;
    
        list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
            if (event->state <= PERF_EVENT_STATE_OFF)
                continue;
            if (!event_filter_match(event))
                continue;
    
            /* may need to reset tstamp_enabled */
            if (is_cgroup_event(event))
                perf_cgroup_mark_enabled(event, ctx);
    
            if (group_can_go_on(event, cpuctx, ))
                group_sched_in(event, cpuctx, ctx);
    
            /*
             * If this pinned group hasn't been scheduled,
             * put it in error state.
             */
            if (event->state == PERF_EVENT_STATE_INACTIVE) {
                update_group_times(event);
                event->state = PERF_EVENT_STATE_ERROR;
            }
        }
    }
    
    |||||||→
    
    static int
    group_sched_in(struct perf_event *group_event,
               struct perf_cpu_context *cpuctx,
               struct perf_event_context *ctx)
    {
        struct perf_event *event, *partial_group = NULL;
        struct pmu *pmu = ctx->pmu;
        u64 now = ctx->time;
        bool simulate = false;
    
        if (group_event->state == PERF_EVENT_STATE_OFF)
            return ;
    
        pmu->start_txn(pmu, PERF_PMU_TXN_ADD);
    
        if (event_sched_in(group_event, cpuctx, ctx)) {
            pmu->cancel_txn(pmu);
            perf_mux_hrtimer_restart(cpuctx);
            return -EAGAIN;
        }
    
        /*
         * Schedule in siblings as one group (if any):
         */
        list_for_each_entry(event, &group_event->sibling_list, group_entry) {
            if (event_sched_in(event, cpuctx, ctx)) {
                partial_group = event;
                goto group_error;
            }
        }
    
        if (!pmu->commit_txn(pmu))
            return ;
    
    group_error:
        /*
         * Groups can be scheduled in as one unit only, so undo any
         * partial group before returning:
         * The events up to the failed event are scheduled out normally,
         * tstamp_stopped will be updated.
         *
         * The failed events and the remaining siblings need to have
         * their timings updated as if they had gone thru event_sched_in()
         * and event_sched_out(). This is required to get consistent timings
         * across the group. This also takes care of the case where the group
         * could never be scheduled by ensuring tstamp_stopped is set to mark
         * the time the event was actually stopped, such that time delta
         * calculation in update_event_times() is correct.
         */
        list_for_each_entry(event, &group_event->sibling_list, group_entry) {
            if (event == partial_group)
                simulate = true;
    
            if (simulate) {
                event->tstamp_running += now - event->tstamp_stopped;
                event->tstamp_stopped = now;
            } else {
                event_sched_out(event, cpuctx, ctx);
            }
        }
        event_sched_out(group_event, cpuctx, ctx);
    
        pmu->cancel_txn(pmu);
    
        perf_mux_hrtimer_restart(cpuctx);
    
        return -EAGAIN;
    }
    
    ||||||||→
    
    static int
    event_sched_in(struct perf_event *event,
             struct perf_cpu_context *cpuctx,
             struct perf_event_context *ctx)
    {
        u64 tstamp = perf_event_time(event);
        int ret = ;
    
        lockdep_assert_held(&ctx->lock);
    
        if (event->state <= PERF_EVENT_STATE_OFF)
            return ;
    
        WRITE_ONCE(event->oncpu, smp_processor_id());
        /*
         * Order event::oncpu write to happen before the ACTIVE state
         * is visible.
         */
        /* 设置event为开工状态:PERF_EVENT_STATE_ACTIVE */
        smp_wmb();
        WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE);
    
        /*
         * Unthrottle events, since we scheduled we might have missed several
         * ticks already, also for a heavily scheduling task there is little
         * guarantee it'll get a tick in a timely manner.
         */
        if (unlikely(event->hw.interrupts == MAX_INTERRUPTS)) {
            perf_log_throttle(event, );
            event->hw.interrupts = ;
        }
    
        /*
         * The new state must be visible before we turn it on in the hardware:
         */
        smp_wmb();
    
        perf_pmu_disable(event->pmu);
    
        perf_set_shadow_time(event, ctx, tstamp);
    
        perf_log_itrace_start(event);
    
        /* 调用event的开工函数:pmu->add() */
        if (event->pmu->add(event, PERF_EF_START)) {
            event->state = PERF_EVENT_STATE_INACTIVE;
            event->oncpu = -;
            ret = -EAGAIN;
            goto out;
        }
    
        event->tstamp_running += tstamp - event->tstamp_stopped;
    
        if (!is_software_event(event))
            cpuctx->active_oncpu++;
        if (!ctx->nr_active++)
            perf_event_ctx_activate(ctx);
        if (event->attr.freq && event->attr.sample_freq)
            ctx->nr_freq++;
    
        if (event->attr.exclusive)
            cpuctx->exclusive = ;
    
        if (is_orphaned_child(event))
            schedule_orphans_remove(ctx);
    
    out:
        perf_pmu_enable(event->pmu);
    
        return ret;
    }           
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    • 157
    • 158
    • 159
    • 160
    • 161
    • 162
    • 163
    • 164
    • 165
    • 166
    • 167
    • 168
    • 169
    • 170
    • 171
    • 172
    • 173
    • 174
    • 175
    • 176
    • 177
    • 178
    • 179
    • 180
    • 181
    • 182
    • 183
    • 184
    • 185
    • 186
    • 187
    • 188
    • 189
    • 190
    • 191
    • 192
    • 193
    • 194
    • 195
    • 196
    • 197
    • 198
    • 199
    • 200
    • 201
    • 202
    • 203
    • 204
    • 205
    • 206
    • 207
    • 208
    • 209
    • 210
    • 211
    • 212
    • 213
    • 214
    • 215
    • 216
    • 217
    • 218
    • 219
    • 220
    • 221
    • 222
    • 223
    • 224
    • 225
    • 226
    • 227
    • 228
    • 229
    • 230
    • 231
    • 232
    • 233
    • 234
    • 235
    • 236
    • 237
    • 238
    • 239
    • 240
    • 241
    • 242
    • 243
    • 244
    • 245
    • 246
    • 247
    • 248
    • 249
    • 250
    • 251
    • 252
    • 253
    • 254
    • 255
    • 256
    • 257
    • 258
    • 259
    • 260
    • 261
    • 262
    • 263
    • 264
    • 265
    • 266
    • 267
    • 268
    • 269
    • 270
    • 271
    • 272
    • 273
    • 274
    • 275
    • 276
    • 277
    • 278
    • 279
    • 280
    • 281
    • 282
    • 283
    • 284
    • 285
    • 286
    • 287
    • 288
    • 289
    • 290
    • 291
    • 292
    • 293
    • 294
    • 295
    • 296
    • 297
    • 298
    • 299
    • 300
    • 301
    • 302
    • 303
    • 304
    • 305
    • 306
    • 307
    • 308
    • 309
    • 310
    • 311
    • 312
    • 313
    • 314
    • 315
    • 316

    如果是pid>0,cpu!=-1的event,在sched_in的时候会调用event_filter_match()判断当前cpu和event绑定的cpu(event->cpu)是否一致,只有符合条件event才能被使能:

    ctx_pinned_sched_in()/ctx_flexible_sched_in() -> event_filter_match():

    static inline int
    event_filter_match(struct perf_event *event)
    {
        /* 如果没有指定cpu:event->cpu == -1 
            或者event绑定cpu和当前cpu一致:event->cpu == smp_processor_id()
            cgroup的filter条件:perf_cgroup_match(event)
            pmu的filter条件:pmu_filter_match(event)
            上述条件符合的情况下,才能event才能被使能
         */
        return (event->cpu == - || event->cpu == smp_processor_id())
            && perf_cgroup_match(event) && pmu_filter_match(event);
    }           
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12

    2.3、cgroup

    暂不分析

    3、perf_ioctl

    通过perf_event_open()系统调用获得perf_event对应的fd后,可以通过操作fd的ioctl命令来配置perf_event。
    static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
    {
        struct perf_event *event = file->private_data;
        struct perf_event_context *ctx;
        long ret;
    
        ctx = perf_event_ctx_lock(event);
        ret = _perf_ioctl(event, cmd, arg);
        perf_event_ctx_unlock(event, ctx);
    
        return ret;
    }
    
    ↓
    
    static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned long arg)
    {
        void (*func)(struct perf_event *);
        u32 flags = arg;
    
        switch (cmd) {
        case PERF_EVENT_IOC_ENABLE:
            func = _perf_event_enable;
            break;
        case PERF_EVENT_IOC_DISABLE:
            func = _perf_event_disable;
            break;
        case PERF_EVENT_IOC_RESET:
            func = _perf_event_reset;
            break;
    
        case PERF_EVENT_IOC_REFRESH:
            return _perf_event_refresh(event, arg);
    
        case PERF_EVENT_IOC_PERIOD:
            return perf_event_period(event, (u64 __user *)arg);
    
        case PERF_EVENT_IOC_ID:
        {
            u64 id = primary_event_id(event);
    
            if (copy_to_user((void __user *)arg, &id, sizeof(id)))
                return -EFAULT;
            return ;
        }
    
        case PERF_EVENT_IOC_SET_OUTPUT:
        {
            int ret;
            if (arg != -) {
                struct perf_event *output_event;
                struct fd output;
                ret = perf_fget_light(arg, &output);
                if (ret)
                    return ret;
                output_event = output.file->private_data;
                ret = perf_event_set_output(event, output_event);
                fdput(output);
            } else {
                ret = perf_event_set_output(event, NULL);
            }
            return ret;
        }
    
        case PERF_EVENT_IOC_SET_FILTER:
            return perf_event_set_filter(event, (void __user *)arg);
    
        case PERF_EVENT_IOC_SET_BPF:
            return perf_event_set_bpf_prog(event, arg);
    
        case PERF_EVENT_IOC_SET_DRV_CONFIGS:
            return perf_event_drv_configs(event, (void __user *)arg);
    
        default:
            return -ENOTTY;
        }
    
        if (flags & PERF_IOC_FLAG_GROUP)
            perf_event_for_each(event, func);
        else
            perf_event_for_each_child(event, func);
    
        return ;
    }           
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84

    3.1、_perf_event_enable

    简单分析一下enable命令。该命令的主要作用就是把处于PERF_EVENT_STATE_OFF状态的event设置成PERF_EVENT_STATE_INACTIVE,以便该event能参与到perf sched当中去。
    static void _perf_event_enable(struct perf_event *event)
    {
        struct perf_event_context *ctx = event->ctx;
        struct task_struct *task = ctx->task;
    
        if (!task) {
            /*
             * Enable the event on the cpu that it's on
             */
            cpu_function_call(event->cpu, __perf_event_enable, event);
            return;
        }
    
        raw_spin_lock_irq(&ctx->lock);
        if (event->state >= PERF_EVENT_STATE_INACTIVE)
            goto out;
    
        /*
         * If the event is in error state, clear that first.
         * That way, if we see the event in error state below, we
         * know that it has gone back into error state, as distinct
         * from the task having been scheduled away before the
         * cross-call arrived.
         */
        if (event->state == PERF_EVENT_STATE_ERROR)
            event->state = PERF_EVENT_STATE_OFF;
    
    retry:
        if (!ctx->is_active) {
            __perf_event_mark_enabled(event);
            goto out;
        }
    
        raw_spin_unlock_irq(&ctx->lock);
    
        if (!task_function_call(task, __perf_event_enable, event))
            return;
    
        raw_spin_lock_irq(&ctx->lock);
    
        /*
         * If the context is active and the event is still off,
         * we need to retry the cross-call.
         */
        if (ctx->is_active && event->state == PERF_EVENT_STATE_OFF) {
            /*
             * task could have been flipped by a concurrent
             * perf_event_context_sched_out()
             */
            task = ctx->task;
            goto retry;
        }
    
    out:
        raw_spin_unlock_irq(&ctx->lock);
    }
    
    ↓
    
    static void __perf_event_mark_enabled(struct perf_event *event)
    {
        struct perf_event *sub;
        u64 tstamp = perf_event_time(event);
    
        event->state = PERF_EVENT_STATE_INACTIVE;
        event->tstamp_enabled = tstamp - event->total_time_enabled;
        list_for_each_entry(sub, &event->sibling_list, group_entry) {
            if (sub->state >= PERF_EVENT_STATE_INACTIVE)
                sub->tstamp_enabled = tstamp - sub->total_time_enabled;
        }
    }           
    • 1
    • 2

为什么有了ftrace又出来一个perf?因为ftrace只管抓trace数据并没有分析,perf在trace数据分析方面做出了很多成果。

继续阅读