CPU的cache往往是分多級的金字塔模型,L1最靠近CPU,通路延遲最小,但cache的容量也最小。首先,根據wikichip[1],獲得CPU 各級cache延遲的基準值(以skylake為例):
Cache Latency
CPU Frequency:2654MHz (0.3768 nanosec/clock)
Cache/Latency | Size | Cycle | Nanosecond |
---|---|---|---|
L1 | 32 KB/core | 4 | 1.5072 |
L2 | 1024 KB/core | 14 | 5.2752 |
L3 | 1.375 MB/core(33 MB/socket) | 50-70 | 18.84-26.37 |
Wikichip提供了不同CPU型号的cache延遲,機關一般為cycle,通過簡單的運算,轉換為ns。
設計實驗
1. Naive thinking
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsIyZuBnL0gDOidTY1MDMzIWNtYWM2gTLlFWY00SMkNTZtYTMyE2M2kTOvwlcvRXakV2dl52Lc12bj5ycj5Wd5lGbh5SdvtWYpp2ZuFGa61ibj1ycz9mLn1WatITY0F2Lc9CX6MHc0RHaiojIsJye.png)
申請一個buffer,buffer size為cache對應的大小,第一次周遊進行預熱,将資料全部加載到cache中。第二次周遊統計耗時,計算每次read的延遲平均值。
代碼實作mem-lat.c如下:
#define ONE p = (char **)*p;
#define FIVE ONE ONE ONE ONE ONE
#define TEN FIVE FIVE
#define FIFTY TEN TEN TEN TEN TEN
#define HUNDRED FIFTY FIFTY
int main()
{
//...
char* mem = mmap(NULL, memsize, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
// trick3: init pointer chasing, per stride=8 byte
size = memsize / stride;
indices = malloc(size * sizeof(int));
for (i = 0; i < size; i++)
indices[i] = i;
// trick 2: fill mem with pointer references
for (i = 0; i < size - 1; i++)
*(char **)&mem[indices[i]*stride]= (char*)&mem[indices[i+1]*stride];
*(char **)&mem[indices[size-1]*stride]= (char*)&mem[indices[0]*stride];
char **p = (char **) mem;
tmp = count / 100;
gettimeofday (&tv1, &tz);
for (i = 0; i < tmp; ++i) {
HUNDRED; //trick 1
}
gettimeofday (&tv2, &tz);
//...
}
這裡用到了3個小技巧:
- HUNDRED宏:通過宏展開,盡可能避免其他指令對訪存的幹擾。
- 二級指針:通過二級指針将buffer串起來,避免訪存時計算偏移。
- char和char*為8位元組,是以,stride為8。
測試結果如下:
//L1
Buffer size: 1 KB, stride 8, time 0.003921 s, latency 3.74 ns
Buffer size: 2 KB, stride 8, time 0.003928 s, latency 3.75 ns
Buffer size: 4 KB, stride 8, time 0.003935 s, latency 3.75 ns
Buffer size: 8 KB, stride 8, time 0.003926 s, latency 3.74 ns
Buffer size: 16 KB, stride 8, time 0.003942 s, latency 3.76 ns
Buffer size: 32 KB, stride 8, time 0.003963 s, latency 3.78 ns
//L2
Buffer size: 64 KB, stride 8, time 0.004043 s, latency 3.86 ns
Buffer size: 128 KB, stride 8, time 0.004054 s, latency 3.87 ns
Buffer size: 256 KB, stride 8, time 0.004051 s, latency 3.86 ns
Buffer size: 512 KB, stride 8, time 0.004049 s, latency 3.86 ns
Buffer size: 1024 KB, stride 8, time 0.004110 s, latency 3.92 ns
//L3
Buffer size: 2048 KB, stride 8, time 0.004126 s, latency 3.94 ns
Buffer size: 4096 KB, stride 8, time 0.004161 s, latency 3.97 ns
Buffer size: 8192 KB, stride 8, time 0.004313 s, latency 4.11 ns
Buffer size: 16384 KB, stride 8, time 0.004272 s, latency 4.07 ns
相比基準值,L1延遲偏大,L2和L3延遲偏小,不符合預期。
2. Thinking with hardware: cache line
現代處理器,記憶體以cache line為粒度,組織在cache中。訪存的讀寫粒度都是一個cache line,最常見的緩存線大小是64位元組。
如果我們簡單的以8位元組為粒度,順序讀取128KB的buffer,假設資料命中的是L2,那麼資料就會被緩存到L1,一個cache line其他的訪存操作都隻會命中L1,進而導緻我們測量的L2延遲明顯偏小。
本文測試的CPU,cacheline大小64位元組,隻需将stride設為64。
//L1
Buffer size: 1 KB, stride 64, time 0.003933 s, latency 3.75 ns
Buffer size: 2 KB, stride 64, time 0.003930 s, latency 3.75 ns
Buffer size: 4 KB, stride 64, time 0.003925 s, latency 3.74 ns
Buffer size: 8 KB, stride 64, time 0.003931 s, latency 3.75 ns
Buffer size: 16 KB, stride 64, time 0.003935 s, latency 3.75 ns
Buffer size: 32 KB, stride 64, time 0.004115 s, latency 3.92 ns
//L2
Buffer size: 64 KB, stride 64, time 0.007423 s, latency 7.08 ns
Buffer size: 128 KB, stride 64, time 0.007414 s, latency 7.07 ns
Buffer size: 256 KB, stride 64, time 0.007437 s, latency 7.09 ns
Buffer size: 512 KB, stride 64, time 0.007429 s, latency 7.09 ns
Buffer size: 1024 KB, stride 64, time 0.007650 s, latency 7.30 ns
Buffer size: 2048 KB, stride 64, time 0.007670 s, latency 7.32 ns
//L3
Buffer size: 4096 KB, stride 64, time 0.007695 s, latency 7.34 ns
Buffer size: 8192 KB, stride 64, time 0.007786 s, latency 7.43 ns
Buffer size: 16384 KB, stride 64, time 0.008172 s, latency 7.79 ns
雖然相比方案1,L2和L3的延遲有所增大,但還是不符合預期。
3. Thinking with hardware: prefetch
現代處理器,通常支援預取(prefetch)。資料預取通過将代碼中後續可能使用到的資料提前加載到cache中,減少CPU等待資料從記憶體中加載的時間,提升cache命中率,進而提升軟體的運作效率。
Intel處理器支援4種硬體預取[2],可以通過MSR控制關閉和打開:
Prefetcher | Bit# in MSR 0x1A4 | Description |
---|---|---|
L2 hardware prefetcher | Fetches additional lines of code or data into the L2 cache | |
L2 adjacent cache line prefetcher | 1 | Fetches the cache line that comprises a cache line pair (128 bytes) |
DCU prefetcher | 2 | Fetches the next cache line into L1-D cache |
DCU IP prefetcher | 3 | Uses sequential load history (based on Instruction Pointer of previous loads) to determine whether to prefetch additional lines |
這裡我們簡單的将stride設為128和256,避免硬體預取。測試的L3訪存延遲明顯增大:
// stride 128
Buffer size: 1 KB, stride 256, time 0.003927 s, latency 3.75 ns
Buffer size: 2 KB, stride 256, time 0.003924 s, latency 3.74 ns
Buffer size: 4 KB, stride 256, time 0.003928 s, latency 3.75 ns
Buffer size: 8 KB, stride 256, time 0.003923 s, latency 3.74 ns
Buffer size: 16 KB, stride 256, time 0.003930 s, latency 3.75 ns
Buffer size: 32 KB, stride 256, time 0.003929 s, latency 3.75 ns
Buffer size: 64 KB, stride 256, time 0.007534 s, latency 7.19 ns
Buffer size: 128 KB, stride 256, time 0.007462 s, latency 7.12 ns
Buffer size: 256 KB, stride 256, time 0.007479 s, latency 7.13 ns
Buffer size: 512 KB, stride 256, time 0.007698 s, latency 7.34 ns
Buffer size: 512 KB, stride 128, time 0.007597 s, latency 7.25 ns
Buffer size: 1024 KB, stride 128, time 0.009169 s, latency 8.74 ns
Buffer size: 2048 KB, stride 128, time 0.010008 s, latency 9.55 ns
Buffer size: 4096 KB, stride 128, time 0.010008 s, latency 9.55 ns
Buffer size: 8192 KB, stride 128, time 0.010366 s, latency 9.89 ns
Buffer size: 16384 KB, stride 128, time 0.012031 s, latency 11.47 ns
// stride 256
Buffer size: 512 KB, stride 256, time 0.007698 s, latency 7.34 ns
Buffer size: 1024 KB, stride 256, time 0.012654 s, latency 12.07 ns
Buffer size: 2048 KB, stride 256, time 0.025210 s, latency 24.04 ns
Buffer size: 4096 KB, stride 256, time 0.025466 s, latency 24.29 ns
Buffer size: 8192 KB, stride 256, time 0.025840 s, latency 24.64 ns
Buffer size: 16384 KB, stride 256, time 0.027442 s, latency 26.17 ns
L3的訪存延遲基本上是符合預期的,但是L1和L2明顯偏大。
如果測試随機訪存延遲,更加通用的做法是,在将buffer指針串起來時,随機化一下。
// shuffle indices
for (i = 0; i < size; i++) {
j = i + rand() % (size - i);
if (i != j) {
tmp = indices[i];
indices[i] = indices[j];
indices[j] = tmp;
}
}
可以看到,測試結果與stride為256基本上是一樣的。
Buffer size: 1 KB, stride 64, time 0.003942 s, latency 3.76 ns
Buffer size: 2 KB, stride 64, time 0.003925 s, latency 3.74 ns
Buffer size: 4 KB, stride 64, time 0.003928 s, latency 3.75 ns
Buffer size: 8 KB, stride 64, time 0.003931 s, latency 3.75 ns
Buffer size: 16 KB, stride 64, time 0.003932 s, latency 3.75 ns
Buffer size: 32 KB, stride 64, time 0.004276 s, latency 4.08 ns
Buffer size: 64 KB, stride 64, time 0.007465 s, latency 7.12 ns
Buffer size: 128 KB, stride 64, time 0.007470 s, latency 7.12 ns
Buffer size: 256 KB, stride 64, time 0.007521 s, latency 7.17 ns
Buffer size: 512 KB, stride 64, time 0.009340 s, latency 8.91 ns
Buffer size: 1024 KB, stride 64, time 0.015230 s, latency 14.53 ns
Buffer size: 2048 KB, stride 64, time 0.027567 s, latency 26.29 ns
Buffer size: 4096 KB, stride 64, time 0.027853 s, latency 26.56 ns
Buffer size: 8192 KB, stride 64, time 0.029945 s, latency 28.56 ns
Buffer size: 16384 KB, stride 64, time 0.034878 s, latency 33.26 ns
4. Thinking with compiler: register keyword
解決掉L3偏小的問題後,我們繼續看L1和L2偏大的原因。為了找出偏大的原因,我們先反彙編可執行程式,看看執行的彙編指令是否是我們想要的:
objdump -D -S mem-lat > mem-lat.s
-
: Display assembler contents of all sections.-D
-
:Intermix source code with disassembly. (gcc編譯時需使用-S
,生成調式資訊)-g
生成的彙編檔案mem-lat.s:
char **p = (char **)mem;
400b3a: 48 8b 45 c8 mov -0x38(%rbp),%rax
400b3e: 48 89 45 d0 mov %rax,-0x30(%rbp) // push stack
//...
HUNDRED;
400b85: 48 8b 45 d0 mov -0x30(%rbp),%rax
400b89: 48 8b 00 mov (%rax),%rax
400b8c: 48 89 45 d0 mov %rax,-0x30(%rbp)
400b90: 48 8b 45 d0 mov -0x30(%rbp),%rax
400b94: 48 8b 00 mov (%rax),%rax
首先,變量mem指派給變量p,變量p壓入棧
-0x30(%rbp)
。
char **p = (char **)mem;
400b3a: 48 8b 45 c8 mov -0x38(%rbp),%rax
400b3e: 48 89 45 d0 mov %rax,-0x30(%rbp)
訪存的邏輯:
HUNDRED; // p = (char **)*p
400b85: 48 8b 45 d0 mov -0x30(%rbp),%rax
400b89: 48 8b 00 mov (%rax),%rax
400b8c: 48 89 45 d0 mov %rax,-0x30(%rbp)
- 先從棧中讀取指針變量p的值到
寄存器(變量p的類型為rax
,是一個二級指針,也就是說,指針p指向一個char **
的變量,即p的值也是一個位址)。下圖中變量p的值為0x2000。char *
- 将
寄存器指向變量的值讀入rax
寄存器,對應單目運算rax
。下圖中位址0x2000的值為0x3000,rax更新為0x3000。*p
-
寄存器指派給變量p。下圖中變量p的值更新為0x3000。rax
根據反彙編的結果可以看到,期望的1條move指令被編譯成了3條,cache的延遲也就增加了3倍。
C語言的register關鍵字,可以讓編譯器将變量儲存到寄存器中,進而避免每次從棧中讀取的開銷。
It's a hint to the compiler that the variable will be heavily used and that you recommend it be kept in a processor register if possible.
我們在聲明p時,加上register關鍵字。
register char **p = (char **)mem;
// L1
Buffer size: 1 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 2 KB, stride 64, time 0.000029 s, latency 0.03 ns
Buffer size: 4 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 8 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 16 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 32 KB, stride 64, time 0.000030 s, latency 0.03 ns
// L2
Buffer size: 64 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 128 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 256 KB, stride 64, time 0.000029 s, latency 0.03 ns
Buffer size: 512 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 1024 KB, stride 64, time 0.000030 s, latency 0.03 ns
// L3
Buffer size: 2048 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 4096 KB, stride 64, time 0.000029 s, latency 0.03 ns
Buffer size: 8192 KB, stride 64, time 0.000030 s, latency 0.03 ns
Buffer size: 16384 KB, stride 64, time 0.000030 s, latency 0.03 ns
訪存延遲全部變為不足1 ns,明顯不符合預期。
5. thinking with compiler: Touch it!
重新反彙編,看看哪裡出了問題,編譯代碼如下:
for (i = 0; i < tmp; ++i) {
40155e: 48 c7 45 f8 00 00 00 movq $0x0,-0x8(%rbp)
401565: 00
401566: eb 05 jmp 40156d <main+0x37e>
401568: 48 83 45 f8 01 addq $0x1,-0x8(%rbp)
40156d: 48 8b 45 f8 mov -0x8(%rbp),%rax
401571: 48 3b 45 b0 cmp -0x50(%rbp),%rax
401575: 72 f1 jb 401568 <main+0x379>
HUNDRED;
}
gettimeofday (&tv2, &tz);
401577: 48 8d 95 78 ff ff ff lea -0x88(%rbp),%rdx
40157e: 48 8d 45 80 lea -0x80(%rbp),%rax
401582: 48 89 d6 mov %rdx,%rsi
401585: 48 89 c7 mov %rax,%rdi
401588: e8 e3 fa ff ff callq 401070 <gettimeofday@plt>
HUNDRED宏沒有産生任何彙編代碼。涉及到變量p的語句,并沒有實際作用,隻是資料讀取,大機率被編譯器優化掉了。
register char **p = (char **) mem;
tmp = count / 100;
gettimeofday (&tv1, &tz);
for (i = 0; i < tmp; ++i) {
HUNDRED;
}
gettimeofday (&tv2, &tz);
/* touch pointer p to prevent compiler optimization */
char **touch = p;
反彙編驗證一下:
HUNDRED;
401570: 48 8b 1b mov (%rbx),%rbx
401573: 48 8b 1b mov (%rbx),%rbx
401576: 48 8b 1b mov (%rbx),%rbx
401579: 48 8b 1b mov (%rbx),%rbx
40157c: 48 8b 1b mov (%rbx),%rbx
HUNDRED宏産生的彙編代碼隻有操作寄存器rbx的mov指令,進階。
延遲的測試結果如下:
// L1
Buffer size: 1 KB, stride 64, time 0.001687 s, latency 1.61 ns
Buffer size: 2 KB, stride 64, time 0.001684 s, latency 1.61 ns
Buffer size: 4 KB, stride 64, time 0.001682 s, latency 1.60 ns
Buffer size: 8 KB, stride 64, time 0.001693 s, latency 1.61 ns
Buffer size: 16 KB, stride 64, time 0.001683 s, latency 1.61 ns
Buffer size: 32 KB, stride 64, time 0.001783 s, latency 1.70 ns
// L2
Buffer size: 64 KB, stride 64, time 0.005896 s, latency 5.62 ns
Buffer size: 128 KB, stride 64, time 0.005915 s, latency 5.64 ns
Buffer size: 256 KB, stride 64, time 0.005955 s, latency 5.68 ns
Buffer size: 512 KB, stride 64, time 0.007856 s, latency 7.49 ns
Buffer size: 1024 KB, stride 64, time 0.014929 s, latency 14.24 ns
// L3
Buffer size: 2048 KB, stride 64, time 0.026970 s, latency 25.72 ns
Buffer size: 4096 KB, stride 64, time 0.026968 s, latency 25.72 ns
Buffer size: 8192 KB, stride 64, time 0.028823 s, latency 27.49 ns
Buffer size: 16384 KB, stride 64, time 0.033325 s, latency 31.78 ns
L1延遲1.61 ns,L2延遲5.62 ns,終于,符合預期!
寫在最後
本文的思路和代碼參考自lmbench[3],和團隊内大佬雛燕的工具mem-lat[4]。最後給自己挖個坑,在随機化buffer指針時,沒有考慮硬體TLB miss的影響,如果有讀者有興趣,待日後有空補充。
參考文獻:
[1]
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server))
[2]
https://software.intel.com/content/www/us/en/develop/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors.html[3]McVoy L W, Staelin C. lmbench: Portable Tools for Performance Analysis[C]//USENIX annual technical conference. 1996: 279-294.
[4]
http://gitlab.alibaba-inc.com/qos/tools/blob/master/memory/mem-lat.c