Above: Shisoyang

In the pursuit of efficient code, we inevitably encounter performance bottlenecks in our code. In order to understand, explain, and try to improve a piece of code that is inefficient, we always need to understand how hardware works. We might try to search for an introduction to an architecture, some optimization guides, or read some computer science textbooks (e.g., how computers are composed). However, the above content may be too cumbersome and too detailed, and in the process of reading, we may get lost in the complicated details and not be able to apply the knowledge well in practice.

The purpose of this article is to introduce common optimization details and related hardware knowledge through multiple runnable benchmarks, and to build a simple and effective hardware mental model for readers.

Hardware knowledge that every programmer should know

Cache

The first thing to talk about is the cache. Let's start with a classic example from CSAPP:

pub fn row_major_traversal(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        for j in 0..n {
            arr[i][j] += j;
        }
    }
}

pub fn column_major_traversal(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        for j in 0..n {
            arr[j][i] += j;
        }
    }
}

In the above two examples, iterate over a 2D array of the same size by row and column, respectively.

We benchmark these two functions:

In the figure above, the vertical axis is the average time and the horizontal axis is the array size (e.g., 2000.0 means the array size is: 2000 x 2000). We see that iterating over arrays by rows is about 10 times more efficient than iterating by columns.

In modern storage architectures, there is a cache between the CPU and the main memory. The data read and write speed of registers, cache, and memory in the CPU is getting slower and slower.

When the CPU reads a piece of data, it first tries to read it from the cache. If a cache miss occurs, the data will be loaded from the main memory to the cache and then read. It's worth noting that every CPU read is measured in cache line. In other words, when the CPU reads a piece of data, it also loads the data adjacent to that data into the cache. Two-dimensional arrays, on the other hand, are arranged in rows in memory, in other words, two adjacent rows in the array are arranged end to end. Therefore, when reading arr[i], adjacent array elements such as arr[i + 1] and arr[i + 2] will also be loaded into the cache, and in the next iteration, when the array element arr[i + 1] needs to be read, it can be directly taken out of the cache, which is very fast. And because when you read an array as a column, arr[i][j] and arr[i + 1][j] are no longer closely connected in memory, but are one size apart from each other in the array's rows. This also results in arr[i + 1][j] not being loaded into the cache when reading arr[i][j]. A cache miss occurs on the next iteration, resulting in a significant drop in read speed.

prefetcher

What if we no longer iterated through the array in some order, but at random?

pub fn row_major_traversal(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        let ri: usize = rand::random();
        std::intrinsics::black_box(ri);
        for j in 0..n {
            arr[i][j] += j;
        }
    }
}

pub fn column_major_traversal(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        let ri: usize = rand::random();
        std::intrinsics::black_box(ri);
        for j in 0..n {
            arr[j][i] += j;
        }
    }
}

pub fn random_access(arr: &mut Vec<Vec<usize>>) {
    let n = arr.len();
    for i in 0..n {
        assert!(arr[i].len() == n);
        for j in 0..n {
            let ri: usize = rand::random();
            arr[j][ri % n] += j;
        }
    }
}

Theoretically, both random traversal and column-based traversal would result in frequent cache misses, so the efficiency of the two should be similar. Next, let's benchmark:

As you can see, random_access is 2 times less efficient than column_major. The reason for this is that there is also a prefetcher between the cache and the CPU

We can refer to the information on Wikipedia:

Cache prefetching can be accomplished either by hardware or by software.

Hardware based prefetching is typically accomplished by having a dedicated hardware mechanism in the processor that watches the stream of instructions or data being requested by the executing program, recognizes the next few elements that the program might need based on this stream and prefetches into the processor's cache.
Software based prefetching is typically accomplished by having the compiler analyze the code and insert additional "prefetch" instructions in the program during compilation itself.

random_access will disable the mechanism of prefetching, further reducing operational efficiency.

cache associativity

What if we iterated over an array in different steps?

pub fn iter_with_step(arr: &mut Vec<usize>, step: usize) {
    let n = arr.len();
    let mut i = 0;
    for _ in 0..1000000 {
        unsafe { arr.get_unchecked_mut(i).add_assign(1); }
        i = (i + step) % n;
    }
}

The steps are:

let steps = [
    1, 
    2, 
    4, 
    7, 8, 9,
    15, 16, 17,
    31, 32, 33,
    61, 64, 67,
    125, 128, 131,
    253, 256, 259, 
    509, 512, 515, 
    1019, 1024, 1031
];

We test:

You can see that when the step is to the power of 2, there will be a runtime protrusion, a performance glitch. Why is that? To answer this question, we need to brush up on some knowledge of counting.

The size of the cache is much smaller than that of the main memory. This means that we need to map the different locations of the main memory to the cache in some way. In general, there are 3 different ways to map.

Fully connected mapping

Fully connected mapping allows rows in main memory to be mapped to any row in the cache. This mapping method is very flexible, but it slows down the cache lookups.

Direct mapping

Direct mapping dictates that a row in main memory can only be mapped to a specific row in the cache. This mapping has high lookup speeds, but low flexibility, and often leads to cache conflicts, resulting in frequent cache misses.

Group concatenation mapping

Group linkage mapping tries to absorb the advantages of the first two, grouping cache rows in the cache, and a row in the main memory can only be mapped to a specific group, and the full linkage mapping is adopted within the group. If there are n cached rows in a group, we call this mapping an n-way set associative.

Back to real CPUs, like: AMD Ryzen 7 4700U.

We can see that the L1 cache size is 4 x 32 KB (128 KB) and takes 8-way group concatenation, with a cache line size of 64 bytes. That is, the cache has a total of 4x32x1024 bytes/64 bytes = 2048 rows, which is divided into groups of 2048/8 = 256. That is, when the step size of the iteration array is , the data is more likely to be grouped into the same group, resulting in more frequent cache misses, resulting in reduced efficiency.

(Note: My cpu is intel i7-10750H, and the number of L1 caches calculated is 384, which is not well explained theoretically.) ）

false share

Let's look at another set of benchmarks.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::thread;

fn increase(v: &AtomicUsize) {
    for _ in 0..10000 {
        v.fetch_add(1, Ordering::Relaxed);
    }
}

pub fn share() {
    let v = AtomicUsize::new(0);
    thread::scope(|s| {
        let ta = s.spawn(|| increase(&v));
        let tb = s.spawn(|| increase(&v));
        let tc = s.spawn(|| increase(&v));
        let td = s.spawn(|| increase(&v));

        ta.join().unwrap();
        tb.join().unwrap();
        tc.join().unwrap();
        td.join().unwrap();
    });
}

pub fn false_share() {
    let a = AtomicUsize::new(0);
    let b = AtomicUsize::new(0);
    let c = AtomicUsize::new(0);
    let d = AtomicUsize::new(0);

    thread::scope(|s| {
        let ta = s.spawn(|| increase(&a));
        let tb = s.spawn(|| increase(&b));
        let tc = s.spawn(|| increase(&c));
        let td = s.spawn(|| increase(&d));

        ta.join().unwrap();
        tb.join().unwrap();
        tc.join().unwrap();
        td.join().unwrap();
    });
}

In the share function, four threads compete for the atomic variable v at the same time. In the false_share function, four threads operate different atomic variables, and theoretically there is no data competition between threads, so the execution efficiency of false_share should be higher than that of share. But the results of the benchmark were unexpected:

You can see that false_share is even less efficient than share.

As mentioned earlier, when the CPU reads data, it loads data from main memory into the cache in units of one cache line size. As mentioned earlier, the cache line size of the author's machine is: 64 bytes. In the false_share function, the arrangement of the four atomic variables in the stack might be:

a, b, c, and d are in the same cache line, which means that the four threads are actually in contention, resulting in false share.

How do we solve this problem? We can use #[repr(align(64))] (which is written differently in different programming languages) to tell the compiler to align the memory addresses of atomic variables with a cache line size, so that four atomic variables are in the same cache line.

fn increase_2(v: &AlignAtomicUsize) {
    for _ in 0..10000 {
        v.v.fetch_add(1, Ordering::Relaxed);
    }
}

#[repr(align(64))]
struct AlignAtomicUsize {
    v: AtomicUsize,
}

impl AlignAtomicUsize {
    pub fn new(val: usize) -> Self {
        Self { v: AtomicUsize::new(val) }
    }
}

pub fn non_share() {
    let a = AlignAtomicUsize::new(0);
    let b = AlignAtomicUsize::new(0);
    let c = AlignAtomicUsize::new(0);
    let d = AlignAtomicUsize::new(0);

    thread::scope(|s| {
        let ta = s.spawn(|| increase_2(&a));
        let tb = s.spawn(|| increase_2(&b));
        let tc = s.spawn(|| increase_2(&c));
        let td = s.spawn(|| increase_2(&d));

        ta.join().unwrap();
        tb.join().unwrap();
        tc.join().unwrap();
        td.join().unwrap();
    });
}

We did the benchmark again, and this time the results of the benchmark matched our predictions:

It can be seen that the non_share is nearly twice as efficient compared to the first two.

pipeline

Let's take another look at the benchmark:

pub trait Get {
    fn get(&self) -> i32;
}

struct Foo { /* ... */ }
struct Bar { /* ... */ }

impl Get for Foo { /* ... */ }
impl Get for Bar { /* ... */ }

let mut arr: Vec<Box<dyn Get>> = vec![];
for _ in 0..10000 {
    arr.push(Box::new(Foo::new()));
}
for _ in 0..10000 {
    arr.push(Box::new(Bar::new()));
}

// 此时数组中元素的排列时有序的
arr.iter().filter(|v| v.get() > 0).count()

// 将数组打乱
arr.shuffle(&mut rand::thread_rng());

// 再次测试
arr.iter().filter(|v| v.get() > 0).count()

The test results were:

As you can see, the difference in efficiency between sorted and unsorted is about 2.67 times. This is due to frequent branch prediction failures.

In a CPU, the execution of each instruction is divided into multiple steps, and in modern computer architectures there is a structure pipeline that can simultaneously execute instructions at different stages of execution.

In order for a pipeline to work efficiently, it needs to read in advance the next instructions B, C, D, etc., when executing instruction A. In normal code, pipelines work efficiently, but when it comes to branching, we run into a problem:

As you can see, should the pipeline read Code A or Code B? No one knows, not even the CPU, until the branching statement is executed. If we want the pipeline to work efficiently, there's only one way left: guess. With enough guesswork, our efficiency is close to that of no branches. The step of "guessing" also has a technical term - **assembly line adventure**. Modern computers, with the help of compilers and some algorithms, can achieve a 99% success rate for some branches (as shown in the figure below).

The cost of failed branch predictions comes at a cost. First, we'll clear the instructions in the pipeline because they're not the next instructions to be executed. Second, we need to load the next instructions into the pipeline one by one. Finally, the instruction is executed in multiple steps.

In the test code, when we shuffle the array, it will cause the branch prediction to fail frequently, which will eventually lead to a decrease in execution efficiency.

Data dependency

Let's take another look at the code:

pub fn dependent(a: &mut Vec<i32>, b: &mut Vec<i32>, c: &Vec<i32>) {
    assert!(a.len() == 100000);
    assert!(b.len() == 100000);
    assert!(c.len() == 100000);

    for i in 0..=99998 {
        a[i] += b[i];
        b[i + 1] += c[i];
    }
    a[9999] += b[9999];
}

pub fn independent(a: &mut Vec<i32>, b: &mut Vec<i32>, c: &Vec<i32>) {
    assert!(a.len() == 100000);
    assert!(b.len() == 100000);
    assert!(c.len() == 100000);

    a[0] += b[0];
    for i in 0..=99998 {
        b[i + 1] += c[i];
        a[i + 1] += b[i + 1];
    }
}

In this piece of code, we iterate over the array in two different ways and end up with an agreed effect. As we draw, the data flow diagram looks like this:

In the diagram above, we use arrows to represent dependencies (a[0] -> b[0] to indicate that the result of a[0] depends on b[0]), black arrows to indicate operations that take place outside the loop, and different colors to represent operations in different iterations. We can see that in dependent, different colored arrows appear in the same data stream, e.g., (red and blue arrows in a[1]->b[1]->c[0]), which means that the n+1 iteration will depend on the result of the nth iteration, which is not the case in independent.

What are the implications? Let's put it to the test:

As you can see, there is a difference in efficiency of almost 3 times. There are two reasons for this.

One is that data dependency can lead to lower pipeline efficiency and CPU instruction-level parallelism.

The second is that this dependency between iterations prevents the compiler's vectorization optimization. We look at equivalent cpp code (the optimization power of rust 1.71 is not enough to vectorize independet, I'm slightly sad).

#include <vector>

using i32 = int;

template<typename T>
using Vec = std::vector<T>;

void dependent(Vec<i32> &a, Vec<i32> &b, Vec<i32> &c) {
    for (int i = 0; i < 9999; i++) {
        a[i] += b[i];
        b[i + 1] += c[i];
    }
    a[9999] += b[9999];
}

void independent(Vec<i32> &a, Vec<i32> &b, Vec<i32> &c) {
    a[0] += b[0];
    for (int i = 0; i < 9999; i++) {
        b[i + 1] += c[i];
        a[i + 1] += b[i + 1];
    }
}

Check out the compilation:

dependent(...):
    mov     rax, rdx
    mov     rdx, QWORD PTR [rsi]
    mov     rcx, QWORD PTR [rdi]
    mov     rdi, QWORD PTR [rax]
    xor     eax, eax
.L2:
    mov     esi, DWORD PTR [rdx+rax]
    add     DWORD PTR [rcx+rax], esi
    mov     esi, DWORD PTR [rdi+rax]
    add     DWORD PTR [rdx+4+rax], esi
    add     rax, 4
    cmp     rax, 39996
    jne     .L2
    mov     eax, DWORD PTR [rdx+39996]
    add     DWORD PTR [rcx+39996], eax
    ret

independent(...):
    mov     rax, QWORD PTR [rdi]
    mov     rcx, rdx
    mov     rdx, QWORD PTR [rsi]
    lea     rdi, [rax+4]
    mov     esi, DWORD PTR [rdx]
    add     DWORD PTR [rax], esi
    lea     r8, [rdx+4]
    mov     rsi, QWORD PTR [rcx]
    lea     rcx, [rdx+20]
    cmp     rdi, rcx
    lea     rdi, [rax+20]
    setnb   cl
    cmp     r8, rdi
    setnb   dil
    or      ecx, edi
    mov     rdi, rdx
    sub     rdi, rsi
    cmp     rdi, 8
    seta    dil
    test    cl, dil
    je      .L9
    mov     rcx, rax
    sub     rcx, rsi
    cmp     rcx, 8
    jbe     .L9
    mov     ecx, 4
.L7:
    movdqu  xmm0, XMMWORD PTR [rsi-4+rcx]
    movdqu  xmm2, XMMWORD PTR [rdx+rcx]
    paddd   xmm0, xmm2
    movups  XMMWORD PTR [rdx+rcx], xmm0
    movdqu  xmm3, XMMWORD PTR [rax+rcx]
    paddd   xmm0, xmm3
    movups  XMMWORD PTR [rax+rcx], xmm0
    add     rcx, 16
    cmp     rcx, 39988
    jne     .L7
    movq    xmm0, QWORD PTR [rsi+39984]
    movq    xmm1, QWORD PTR [rdx+39988]
    paddd   xmm0, xmm1
    movq    QWORD PTR [rdx+39988], xmm0
    movq    xmm1, QWORD PTR [rax+39988]
    paddd   xmm1, xmm0
    movq    QWORD PTR [rax+39988], xmm1
    mov     ecx, DWORD PTR [rdx+39996]
    add     ecx, DWORD PTR [rsi+39992]
    mov     DWORD PTR [rdx+39996], ecx
    add     DWORD PTR [rax+39996], ecx
    ret
.L9:
    mov     ecx, 4
.L6:
    mov     edi, DWORD PTR [rdx+rcx]
    add     edi, DWORD PTR [rsi-4+rcx]
    mov     DWORD PTR [rdx+rcx], edi
    add     DWORD PTR [rax+rcx], edi
    add     rcx, 4
    cmp     rcx, 40000
    jne     .L6
    retAI助手

As you can see, the independent function is successfully vectorized.