天天看點

揭秘 cache 通路延遲背後的計算機原理

CPU 的 cache 往往是分多級的金字塔模型,L1 最靠近 CPU,通路延遲最小,但 cache 的容量也最小。本文介紹如何測試多級 cache 的訪存延遲,以及背後蘊含的計算機原理。

揭秘 cache 通路延遲背後的計算機原理
圖源:https://cs.brown.edu/courses/csci1310/2020/assign/labs/lab4.html

Cache Latency

Wikichip[1] 提供了不同 CPU 型号的 cache 延遲,機關一般為 cycle,通過簡單的運算,轉換為 ns。以 skylake 為例,CPU 各級 cache 延遲的基準值為:

揭秘 cache 通路延遲背後的計算機原理

CPU Frequency:2654MHz (0.3768 nanosec/clock)

設計實驗

1. naive thinking

揭秘 cache 通路延遲背後的計算機原理

申請一個 buffer,buffer size 為 cache 對應的大小,第一次周遊進行預熱,将資料全部加載到 cache 中。第二次周遊統計耗時,計算每次 read 的延遲平均值。

代碼實作 mem-lat.c 如下:

#include <sys/types.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <unistd.h>

#define ONE p = (char **)*p;
#define FIVE    ONE ONE ONE ONE ONE
#define TEN FIVE FIVE
#define FIFTY   TEN TEN TEN TEN TEN
#define HUNDRED FIFTY FIFTY

static void usage()
{
    printf("Usage: ./mem-lat -b xxx -n xxx -s xxx\n");
    printf("   -b buffer size in KB\n");
    printf("   -n number of read\n\n");
    printf("   -s stride skipped before the next access\n\n");
    printf("Please don't use non-decimal based number\n");
}


int main(int argc, char* argv[])
{
  unsigned long i, j, size, tmp;
    unsigned long memsize = 0x800000; /* 1/4 LLC size of skylake, 1/5 of broadwell */
    unsigned long count = 1048576; /* memsize / 64 * 8 */
    unsigned int stride = 64; /* skipped amount of memory before the next access */
    unsigned long sec, usec;
    struct timeval tv1, tv2;
    struct timezone tz;
    unsigned int *indices;

    while (argc-- > 0) {
        if ((*argv)[0] == '-') {  /* look at first char of next */
            switch ((*argv)[1]) {   /* look at second */
                case 'b':
                    argv++;
                    argc--;
                    memsize = atoi(*argv) * 1024;
                    break;

                case 'n':
                    argv++;
                    argc--;
                    count = atoi(*argv);
                    break;

                case 's':
                    argv++;
                    argc--;
                    stride = atoi(*argv);
                    break;

                default:
                    usage();
                    exit(1);
                    break;
            }
        }
        argv++;
    }


  char* mem = mmap(NULL, memsize, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
    // trick3: init pointer chasing, per stride=8 byte
    size = memsize / stride;
    indices = malloc(size * sizeof(int));

    for (i = 0; i < size; i++)
        indices[i] = i;
    
    // trick 2: fill mem with pointer references
    for (i = 0; i < size - 1; i++)
        *(char **)&mem[indices[i]*stride]= (char*)&mem[indices[i+1]*stride];
    *(char **)&mem[indices[size-1]*stride]= (char*)&mem[indices[0]*stride];

    char **p = (char **) mem;
    tmp = count / 100;

    gettimeofday (&tv1, &tz);
    for (i = 0; i < tmp; ++i) {
        HUNDRED;  //trick 1
    }
    gettimeofday (&tv2, &tz);
    if (tv2.tv_usec < tv1.tv_usec) {
        usec = 1000000 + tv2.tv_usec - tv1.tv_usec;
        sec = tv2.tv_sec - tv1.tv_sec - 1;
    } else {
        usec = tv2.tv_usec - tv1.tv_usec;
        sec = tv2.tv_sec - tv1.tv_sec;
    }

    printf("Buffer size: %ld KB, stride %d, time %d.%06d s, latency %.2f ns\n", 
            memsize/1024, stride, sec, usec, (sec * 1000000  + usec) * 1000.0 / (tmp *100));
    munmap(mem, memsize);
    free(indices);
}           

這裡用到了 3 個小技巧:

  • HUNDRED 宏:通過宏展開,盡可能避免其他指令對訪存的幹擾。
  • 二級指針:通過二級指針将buffer串起來,避免訪存時計算偏移。
  • char* 和 char** 為 8 位元組,是以,stride 為 8。

測試方法:

#set -x

work=./mem-lat
buffer_size=1
stride=8

for i in `seq 1 15`; do
    taskset -ac 0 $work -b $buffer_size -s $stride
    buffer_size=$(($buffer_size*2))
done           

測試結果如下:

//L1
Buffer size: 1 KB, stride 8, time 0.003921 s, latency 3.74 ns
Buffer size: 2 KB, stride 8, time 0.003928 s, latency 3.75 ns
Buffer size: 4 KB, stride 8, time 0.003935 s, latency 3.75 ns
Buffer size: 8 KB, stride 8, time 0.003926 s, latency 3.74 ns
Buffer size: 16 KB, stride 8, time 0.003942 s, latency 3.76 ns
Buffer size: 32 KB, stride 8, time 0.003963 s, latency 3.78 ns
//L2
Buffer size: 64 KB, stride 8, time 0.004043 s, latency 3.86 ns
Buffer size: 128 KB, stride 8, time 0.004054 s, latency 3.87 ns
Buffer size: 256 KB, stride 8, time 0.004051 s, latency 3.86 ns
Buffer size: 512 KB, stride 8, time 0.004049 s, latency 3.86 ns
Buffer size: 1024 KB, stride 8, time 0.004110 s, latency 3.92 ns
//L3
Buffer size: 2048 KB, stride 8, time 0.004126 s, latency 3.94 ns
Buffer size: 4096 KB, stride 8, time 0.004161 s, latency 3.97 ns
Buffer size: 8192 KB, stride 8, time 0.004313 s, latency 4.11 ns
Buffer size: 16384 KB, stride 8, time 0.004272 s, latency 4.07 ns           

相比基準值,L1 延遲偏大,L2 和 L3 延遲偏小,不符合預期。

2. thinking with hardware: cache line

現代處理器,記憶體以 cache line 為粒度,組織在 cache 中。訪存的讀寫粒度都是一個 cache line,最常見的緩存線大小是 64 位元組。

揭秘 cache 通路延遲背後的計算機原理

如果我們簡單的以 8 位元組為粒度,順序讀取 128KB 的 buffer,假設資料命中的是 L2,那麼資料就會被緩存到 L1,一個 cache line 其他的訪存操作都隻會命中 L1,進而導緻我們測量的 L2 延遲明顯偏小。

本文測試的 CPU,cacheline 大小 64 位元組,隻需将 stride 設為 64。

揭秘 cache 通路延遲背後的計算機原理
//L1
Buffer size: 1 KB, stride 64, time 0.003933 s, latency 3.75 ns
Buffer size: 2 KB, stride 64, time 0.003930 s, latency 3.75 ns
Buffer size: 4 KB, stride 64, time 0.003925 s, latency 3.74 ns
Buffer size: 8 KB, stride 64, time 0.003931 s, latency 3.75 ns
Buffer size: 16 KB, stride 64, time 0.003935 s, latency 3.75 ns
Buffer size: 32 KB, stride 64, time 0.004115 s, latency 3.92 ns
//L2
Buffer size: 64 KB, stride 64, time 0.007423 s, latency 7.08 ns
Buffer size: 128 KB, stride 64, time 0.007414 s, latency 7.07 ns
Buffer size: 256 KB, stride 64, time 0.007437 s, latency 7.09 ns
Buffer size: 512 KB, stride 64, time 0.007429 s, latency 7.09 ns
Buffer size: 1024 KB, stride 64, time 0.007650 s, latency 7.30 ns
Buffer size: 2048 KB, stride 64, time 0.007670 s, latency 7.32 ns
//L3
Buffer size: 4096 KB, stride 64, time 0.007695 s, latency 7.34 ns
Buffer size: 8192 KB, stride 64, time 0.007786 s, latency 7.43 ns
Buffer size: 16384 KB, stride 64, time 0.008172 s, latency 7.79 ns           

雖然相比方案 1,L2 和 L3 的延遲有所增大,但還是不符合預期。

3. thinking with hardware: prefetch

現代處理器,通常支援預取(prefetch)。資料預取通過将代碼中後續可能使用到的資料提前加載到 cache 中,減少 CPU 等待資料從記憶體中加載的時間,提升 cache 命中率,進而提升軟體的運作效率。

Intel 處理器支援 4 種硬體預取 [2],可以通過 MSR 控制關閉和打開:

揭秘 cache 通路延遲背後的計算機原理

這裡我們簡單的将 stride 設為 128 和 256,避免硬體預取。測試的 L3 訪存延遲明顯增大:

// stride 128
Buffer size: 1 KB, stride 256, time 0.003927 s, latency 3.75 ns
Buffer size: 2 KB, stride 256, time 0.003924 s, latency 3.74 ns
Buffer size: 4 KB, stride 256, time 0.003928 s, latency 3.75 ns
Buffer size: 8 KB, stride 256, time 0.003923 s, latency 3.74 ns
Buffer size: 16 KB, stride 256, time 0.003930 s, latency 3.75 ns
Buffer size: 32 KB, stride 256, time 0.003929 s, latency 3.75 ns
Buffer size: 64 KB, stride 256, time 0.007534 s, latency 7.19 ns
Buffer size: 128 KB, stride 256, time 0.007462 s, latency 7.12 ns
Buffer size: 256 KB, stride 256, time 0.007479 s, latency 7.13 ns
Buffer size: 512 KB, stride 256, time 0.007698 s, latency 7.34 ns
Buffer size: 512 KB, stride 128, time 0.007597 s, latency 7.25 ns
Buffer size: 1024 KB, stride 128, time 0.009169 s, latency 8.74 ns
Buffer size: 2048 KB, stride 128, time 0.010008 s, latency 9.55 ns
Buffer size: 4096 KB, stride 128, time 0.010008 s, latency 9.55 ns
Buffer size: 8192 KB, stride 128, time 0.010366 s, latency 9.89 ns
Buffer size: 16384 KB, stride 128, time 0.012031 s, latency 11.47 ns

// stride 256
Buffer size: 512 KB, stride 256, time 0.007698 s, latency 7.34 ns
Buffer size: 1024 KB, stride 256, time 0.012654 s, latency 12.07 ns
Buffer size: 2048 KB, stride 256, time 0.025210 s, latency 24.04 ns
Buffer size: 4096 KB, stride 256, time 0.025466 s, latency 24.29 ns
Buffer size: 8192 KB, stride 256, time 0.025840 s, latency 24.64 ns
Buffer size: 16384 KB, stride 256, time 0.027442 s, latency 26.17 ns           

L3 的訪存延遲基本上是符合預期的,但是 L1 和 L2 明顯偏大。

如果測試随機訪存延遲,更加通用的做法是,在将buffer指針串起來時,随機化一下。

// shuffle indices
     for (i = 0; i < size; i++) {
         j = i +  rand() % (size - i);
         if (i != j) {
             tmp = indices[i];
             indices[i] = indices[j];
             indices[j] = tmp;
         }
     }           

可以看到,測試結果與 stride 為 256 基本上是一樣的。

Buffer size: 1 KB, stride 64, time 0.003942 s, latency 3.76 ns
Buffer size: 2 KB, stride 64, time 0.003925 s, latency 3.74 ns
Buffer size: 4 KB, stride 64, time 0.003928 s, latency 3.75 ns
Buffer size: 8 KB, stride 64, time 0.003931 s, latency 3.75 ns
Buffer size: 16 KB, stride 64, time 0.003932 s, latency 3.75 ns
Buffer size: 32 KB, stride 64, time 0.004276 s, latency 4.08 ns
Buffer size: 64 KB, stride 64, time 0.007465 s, latency 7.12 ns
Buffer size: 128 KB, stride 64, time 0.007470 s, latency 7.12 ns
Buffer size: 256 KB, stride 64, time 0.007521 s, latency 7.17 ns
Buffer size: 512 KB, stride 64, time 0.009340 s, latency 8.91 ns
Buffer size: 1024 KB, stride 64, time 0.015230 s, latency 14.53 ns
Buffer size: 2048 KB, stride 64, time 0.027567 s, latency 26.29 ns
Buffer size: 4096 KB, stride 64, time 0.027853 s, latency 26.56 ns
Buffer size: 8192 KB, stride 64, time 0.029945 s, latency 28.56 ns
Buffer size: 16384 KB, stride 64, time 0.034878 s, latency 33.26 ns           

4. thinking with compiler: register keyword

解決掉 L3 偏小的問題後,我們繼續看 L1 和 L2 偏大的原因。為了找出偏大的原因,我們先反彙編可執行程式,看看執行的彙編指令是否是我們想要的:

objdump -D -S mem-lat > mem-lat.s           

為卡片添加間距

删除卡片

  • -D: Display assembler contents of all sections.
  • -S:Intermix source code with disassembly. (gcc編譯時需使用-g,生成調式資訊)

生成的彙編檔案 mem-lat.s:

char **p = (char **)mem;
  400b3a:   48 8b 45 c8             mov    -0x38(%rbp),%rax
  400b3e:   48 89 45 d0             mov    %rax,-0x30(%rbp) // push stack

//...   
    HUNDRED;
  400b85:   48 8b 45 d0             mov    -0x30(%rbp),%rax
  400b89:   48 8b 00                mov    (%rax),%rax
  400b8c:   48 89 45 d0             mov    %rax,-0x30(%rbp)
  400b90:   48 8b 45 d0             mov    -0x30(%rbp),%rax
  400b94:   48 8b 00                mov    (%rax),%rax           

首先,變量 mem 指派給變量 p,變量 p 壓入棧-0x30(%rbp)。

char **p = (char **)mem;
  400b3a:   48 8b 45 c8             mov    -0x38(%rbp),%rax
  400b3e:   48 89 45 d0             mov    %rax,-0x30(%rbp)           

訪存的邏輯:

HUNDRED;  // p = (char **)*p
  400b85:   48 8b 45 d0             mov    -0x30(%rbp),%rax
  400b89:   48 8b 00                mov    (%rax),%rax
  400b8c:   48 89 45 d0             mov    %rax,-0x30(%rbp)           
  • 先從棧中讀取指針變量 p 的值到rax寄存器(變量 p 的類型為char **,是一個二級指針,也就是說,指針 p 指向一個char *的變量,即 p 的值也是一個位址)。下圖中變量 p 的值為 0x2000。
  • 将rax寄存器指向變量的值讀入rax寄存器,對應單目運算*p。下圖中位址 0x2000的值為 0x3000,rax 更新為 0x3000。
  • 将rax寄存器指派給變量p。下圖中變量p的值更新為0x3000。
揭秘 cache 通路延遲背後的計算機原理
揭秘 cache 通路延遲背後的計算機原理

根據反彙編的結果可以看到,期望的 1 條 move 指令被編譯成了 3 條,cache 的延遲也就增加了 3 倍。

C 語言的 register 關鍵字,可以讓編譯器将變量儲存到寄存器中,進而避免每次從棧中讀取的開銷。

It's a hint to the compiler that the variable will be heavily used and that you recommend it be kept in a processor register if possible.

我們在聲明 p 時,加上 register 關鍵字。

register char **p = (char **)mem;           
// L1
Buffer size: 1 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 2 KB, stride 64, time 0.000029 s, latency 0.03 ns
Buffer size: 4 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 8 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 16 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 32 KB, stride 64, time 0.000030 s, latency 0.03 ns
// L2
Buffer size: 64 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 128 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 256 KB, stride 64, time 0.000029 s, latency 0.03 ns
Buffer size: 512 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 1024 KB, stride 64, time 0.000030 s, latency 0.03 ns
// L3
Buffer size: 2048 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 4096 KB, stride 64, time 0.000029 s, latency 0.03 ns
Buffer size: 8192 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 16384 KB, stride 64, time 0.000030 s, latency 0.03 ns           

訪存延遲全部變為不足 1 ns,明顯不符合預期。

5. thinking with compiler: Touch it!

重新反彙編,看看哪裡出了問題,編譯代碼如下:

for (i = 0; i < tmp; ++i) {
   40155e:   48 c7 45 f8 00 00 00    movq   $0x0,-0x8(%rbp)
   401565:   00
   401566:   eb 05                   jmp    40156d <main+0x37e>
   401568:   48 83 45 f8 01          addq   $0x1,-0x8(%rbp)
   40156d:   48 8b 45 f8             mov    -0x8(%rbp),%rax
   401571:   48 3b 45 b0             cmp    -0x50(%rbp),%rax
   401575:   72 f1                   jb     401568 <main+0x379>
         HUNDRED;
     }
     gettimeofday (&tv2, &tz);
   401577:   48 8d 95 78 ff ff ff    lea    -0x88(%rbp),%rdx
   40157e:   48 8d 45 80             lea    -0x80(%rbp),%rax
   401582:   48 89 d6                mov    %rdx,%rsi
   401585:   48 89 c7                mov    %rax,%rdi
   401588:   e8 e3 fa ff ff          callq  401070 <gettimeofday@plt>           

HUNDRED 宏沒有産生任何彙編代碼。涉及到變量 p 的語句,并沒有實際作用,隻是資料讀取,大機率被編譯器優化掉了。

register char **p = (char **) mem;
     tmp = count / 100;

     gettimeofday (&tv1, &tz);
     for (i = 0; i < tmp; ++i) {
         HUNDRED;
     }
     gettimeofday (&tv2, &tz);

     /* touch pointer p to prevent compiler optimization */
     char **touch = p;           

反彙編驗證一下:

HUNDRED;
   401570:   48 8b 1b                mov    (%rbx),%rbx
   401573:   48 8b 1b                mov    (%rbx),%rbx
   401576:   48 8b 1b                mov    (%rbx),%rbx
   401579:   48 8b 1b                mov    (%rbx),%rbx
   40157c:   48 8b 1b                mov    (%rbx),%rbx           

HUNDRED 宏産生的彙編代碼隻有操作寄存器 rbx 的 mov 指令,進階。

延遲的測試結果如下:

// L1
Buffer size: 1 KB, stride 64, time 0.001687 s, latency 1.61 ns
Buffer size: 2 KB, stride 64, time 0.001684 s, latency 1.61 ns
Buffer size: 4 KB, stride 64, time 0.001682 s, latency 1.60 ns
Buffer size: 8 KB, stride 64, time 0.001693 s, latency 1.61 ns
Buffer size: 16 KB, stride 64, time 0.001683 s, latency 1.61 ns
Buffer size: 32 KB, stride 64, time 0.001783 s, latency 1.70 ns
// L2
Buffer size: 64 KB, stride 64, time 0.005896 s, latency 5.62 ns
Buffer size: 128 KB, stride 64, time 0.005915 s, latency 5.64 ns
Buffer size: 256 KB, stride 64, time 0.005955 s, latency 5.68 ns
Buffer size: 512 KB, stride 64, time 0.007856 s, latency 7.49 ns
Buffer size: 1024 KB, stride 64, time 0.014929 s, latency 14.24 ns
// L3
Buffer size: 2048 KB, stride 64, time 0.026970 s, latency 25.72 ns
Buffer size: 4096 KB, stride 64, time 0.026968 s, latency 25.72 ns
Buffer size: 8192 KB, stride 64, time 0.028823 s, latency 27.49 ns
Buffer size: 16384 KB, stride 64, time 0.033325 s, latency 31.78 ns           

L1 延遲 1.61 ns,L2 延遲 5.62 ns,終于,符合預期!

寫在最後

本文的思路和代碼參考自 lmbench[3],和團隊内其他同學的工具 mem-lat。最後給自己挖個坑,在随機化 buffer 指針時,沒有考慮硬體 TLB miss 的影響,如果有讀者有興趣,待日後有空補充。

參考文獻:

[1] https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)

[2]https://software.intel.com/content/www/us/en/develop/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors.html

[3]McVoy L W, Staelin C. lmbench: Portable Tools for Performance Analysis[C]//USENIX annual technical conference. 1996: 279-294.

原文連結:https://developer.aliyun.com/article/859576?utm_content=g_1000322648

本文為阿裡雲原創内容,未經允許不得轉載。

繼續閱讀