性能優化篇（4）：NEON優化案例——圖像顔色轉換之RGB到BGR（aarch64版）

Author:stormQ

Sunday, 1. December 2019 1 07:13AM

目錄
- 為什麼要交換圖像顔色通道
- 實作一個基本的圖像顔色通道交換函數
- 利用 NEON 加速圖像顔色通道交換

為什麼要交換圖像顔色通道

對于隻有

、

三種通道的圖像來說，OpenCV預設的通道排列方式為

BGR

而非常見的

RGB

。在一個應用内，如果同時存在“有些圖像處理函數以

RGB

圖像作為輸入輸出，而有些圖像處理函數以

BGR

圖像作為輸入輸出”，那麼在調用這些函數前需要進行圖像顔色通道轉換，以滿足接口的輸入要求。由于圖像顔色通道轉換比較耗時，甚至可能成為性能瓶頸。是以，研究如何加速圖像顔色通道交換就非常有意義了。

實作一個基本的圖像顔色通道交換函數

如何将一個大小為

1920x1080

的

RGB

圖像轉換成相同大小的

BGR

圖像呢？

首先實作一個最簡單的轉換函數

simple_rgb2bgr()

，完整實作如下：

#define CHANNELS 3

void simple_rgb2bgr(uint8_t *img, int height, int width)
{
    const int total_bytes = height * width * CHANNELS;
    for (int i = 0; i < total_bytes; i += CHANNELS)
    {
        // swap R and B channel
        const auto r_channel = *(img + i);
        *(img + i) = *(img + i + 2);
        *(img + i + 2) = r_channel;
    }
}

simple_rgb2bgr()

函數将一個二維的圖像看作是一維的，意味着隻需要一個循環就可以周遊所有的像素。這種用法的好處在于，遵循空間局部性原理以高效利用

cache

進而改善程式性能，而不必關心二維圖像在記憶體中的存儲順序——是行主序還是列主序。如果用兩個循環的話會比較繁瑣，移植性也不好。

上述函數在

Jetson TX2

上運作時會耗時 3~4ms 左右，在其他機器上運作時耗時會有所不同。如果該耗時不可接受，那麼隻能想辦法對其進行加速。常見的加速思路有以下幾種：1）任務級并行（利用多線程）；2）資料級并行（利用 GPU）；3）利用 SIMD。這裡我們隻研究第三種加速方式——利用 SIMD，它沒有“任務級并行所需要的正确的排程協作和占用更多的 CPU 資源”，也沒有“資料級并行所帶來的 GPU 與 CPU 之間資料傳輸開銷”。

利用 NEON 加速圖像顔色通道交換

首先，實作一個基本的利用

NEON

加速圖像顔色通道交換的函數

rgb2bgr_with_neon()

，完整實作如下：

#include "arm_neon.h"

void rgb2bgr_with_neon(uint8_t *img, int height, int width)
{
    const int total_bytes = height * width * CHANNELS;
    const int stride_bytes = 48;

    for (int i = 0; i < total_bytes; i += stride_bytes)
    {
        uint8_t *target = img + i;

        // swap R and B channel with NEON
        uint8x16x3_t a = vld3q_u8(target);
        uint8x16x3_t b;
        b.val[0] = a.val[2];
        b.val[1] = a.val[1];
        b.val[2] = a.val[0];
        vst3q_u8(target, b);
    }
}

上述函數沒有考慮圖像像素個數非整除時的情形，這個放到後面進行。該函數的實作用到了兩個

NEON Intrinsics

，分别是

vld3q_u8

和

vst3q_u8

，兩者都定義在

arm_neon.h

檔案中。

vld3q_u8

的函數原型為

uint8x16x3_t vld3q_u8 (const uint8_t * __a)

，作用為以步長為 3 交叉地加載資料到三個連續的 128-bit 的向量寄存器。具體地：将記憶體位址

__a

、

__a+3

、…、

__a+45

處的内容分别指派給向量寄存器

Vn

的

lane[0]

、

lane[1]

、…、

lane[15]

，将記憶體位址

__a+1

、

__a+4

、…、

__a+46

處的内容分别指派給向量寄存器

Vn+1

的

lane[0]

、

lane[1]

、…、

lane[15]

，将記憶體位址

__a+2

、

__a+5

、…、

__a+47

處的内容分别指派給向量寄存器

Vn+2

的

lane[0]

、

lane[1]

、…、

lane[15]

。也就是說，

vld3q_u8

在上述函數中的作用為：将連續 16 個像素的 R 通道： R0、R1、… 、R15 加載到向量寄存器

Vn

，将連續 16 個像素的 G 通道： G0、G1、… 、G15 加載到向量寄存器

Vn+1

，将連續 16 個像素的 B 通道： B0、B1、… 、B15 加載到向量寄存器

Vn+2

。

vst3q_u8

的函數原型為

void vst3q_u8 (uint8_t * __a, uint8x16x3_t val)

，作用為以步長為 3 交叉地存儲資料到記憶體中。

由于每一次疊代可以同時處理 16 個連續的像素，這些像素占用 48 位元組（48（bytes） = 16（像素個數） * 3（通道數））。是以，變量

stride_bytes

的值為 48。

由于

aarch64

指令集中沒有

ARM

指令集中的

VSWP

指令。是以，這裡引入一個類型為

uint8x16x3_t

的臨時向量

用于交換

和

通道。 g++ 帶優化選項

-Og

編譯時，語句

b.val[0] = a.val[2]; b.val[1] = a.val[1]; b.val[2] = a.val[0];

對應三條

MOV

指令。

在通道交換完成後，用語句

vst3q_u8(target, b);

将交換後的結果寫入記憶體。這樣，一次交換操作就完成了。

現在考慮圖像像素個數非整除時的情形，完整實作如下：

void rgb2bgr_with_neon(uint8_t *img, int height, int width)
{
    const int total_bytes = height * width * CHANNELS;
    const int stride_bytes = 48;
    const int left_bytes = total_bytes % stride_bytes;
    const int multiply_bytes = total_bytes - left_bytes;

    for (int i = 0; i < multiply_bytes; i += stride_bytes)
    {
        uint8_t *target = img + i;

        // swap R and B channel with NEON
        uint8x16x3_t a = vld3q_u8(target);
        uint8x16x3_t b;
        b.val[0] = a.val[2];
        b.val[1] = a.val[1];
        b.val[2] = a.val[0];
        vst3q_u8(target, b);
    }

    // handling non-multiply array lengths
    for (int i = multiply_bytes; i < total_bytes; i += CHANNELS)
    {
        const auto r_channel = *(img + i);
        *(img + i) = *(img + i + 2);
        *(img + i + 2) = r_channel;
    }
}

為了驗證

rgb2bgr_with_neon()

函數的正确性，這裡引入兩個函數：

init()

和

check()

。

init()

函數用于初始化圖像，

check()

函數用于檢查交換後的圖像是否正确。完整實作如下：

#include <iostream>

#define RED 255
#define GREEN 125
#define BLUE 80

#define CHECK(img, height, width)               \
if (check(img, height, width))                  \
{                                               \
    std::cout << "Correct:img.\n";              \
}                                               \
else                                            \
{                                               \
    std::cout << "Error:img.\n";                \
}

void init(uint8_t *img, int height, int width)
{
    const int total_bytes = height * width * CHANNELS;
    for (int i = 0; i < total_bytes; )
    {
        *(img + i++) = RED;
        *(img + i++) = GREEN;
        *(img + i++) = BLUE;
    }
}

bool check(const uint8_t *img, int height, int width)
{
    const int total_bytes = height * width * CHANNELS;
    for (int i = 0; i < total_bytes; i += CHANNELS)
    {
        if (*(img + i) != BLUE || *(img + i + 1) != GREEN || 
                *(img + i + 2) != RED)
        {
            return false;
        }
    }
    return true;
}

完整的程式為

main.cpp

：

#include "arm_neon.h"
#include <iostream>

#define HEIGHT 1080
#define WIDTH 1920
#define CHANNELS 3
#define RED 255
#define GREEN 125
#define BLUE 80

#define CHECK(img, height, width)               \
if (check(img, height, width))                  \
{                                               \
    std::cout << "Correct:img.\n";              \
}                                               \
else                                            \
{                                               \
    std::cout << "Error:img.\n";                \
}

void init(uint8_t *img, int height, int width)
{
    const int total_bytes = height * width * CHANNELS;
    for (int i = 0; i < total_bytes; )
    {
        *(img + i++) = RED;
        *(img + i++) = GREEN;
        *(img + i++) = BLUE;
    }
}

bool check(const uint8_t *img, int height, int width)
{
    const int total_bytes = height * width * CHANNELS;
    for (int i = 0; i < total_bytes; i += CHANNELS)
    {
        if (*(img + i) != BLUE || *(img + i + 1) != GREEN || 
                *(img + i + 2) != RED)
        {
            return false;
        }
    }
    return true;
}

void simple_rgb2bgr(uint8_t *img, int height, int width)
{
    const int total_bytes = height * width * CHANNELS;
    for (int i = 0; i < total_bytes; i += CHANNELS)
    {
        // swap R and B channel
        const auto r_channel = *(img + i);
        *(img + i) = *(img + i + 2);
        *(img + i + 2) = r_channel;
    }
}

void rgb2bgr_with_neon(uint8_t *img, int height, int width)
{
    const int total_bytes = height * width * CHANNELS;
    const int stride_bytes = 48;
    const int left_bytes = total_bytes % stride_bytes;
    const int multiply_bytes = total_bytes - left_bytes;

    for (int i = 0; i < multiply_bytes; i += stride_bytes)
    {
        uint8_t *target = img + i;

        // swap R and B channel with NEON
        uint8x16x3_t a = vld3q_u8(target);
        uint8x16x3_t b;
        b.val[0] = a.val[2];
        b.val[1] = a.val[1];
        b.val[2] = a.val[0];
        vst3q_u8(target, b);
    }

    // handling non-multiply array lengths
    for (int i = multiply_bytes; i < total_bytes; i += CHANNELS)
    {
        const auto r_channel = *(img + i);
        *(img + i) = *(img + i + 2);
        *(img + i + 2) = r_channel;
    }
}

uint8_t *rgb_img1 = nullptr;
uint8_t *rgb_img2 = nullptr;

int main()
{
    rgb_img1 = new uint8_t[HEIGHT * WIDTH * CHANNELS];
    rgb_img2 = new uint8_t[HEIGHT * WIDTH * CHANNELS];

    init(rgb_img1, HEIGHT, WIDTH);
    init(rgb_img2, HEIGHT, WIDTH);

    simple_rgb2bgr(rgb_img1, HEIGHT, WIDTH);
    rgb2bgr_with_neon(rgb_img2, HEIGHT, WIDTH);

    CHECK(rgb_img1, HEIGHT, WIDTH);
    CHECK(rgb_img2, HEIGHT, WIDTH);
    
    delete [] rgb_img2;
    delete [] rgb_img1;
    return 0;
}

編譯程式（On Jetson TX2）：

g++ -std=c++11 -g -Og -o main_Og main.cpp

統計函數耗時（On Jetson TX2）：

啟動程式方式	第一次執行耗時(us)	第二次執行耗時(us)	第三次執行耗時(us)	第四次執行耗時(us)	第五次執行耗時(us)
./main_Og	simple_rgb2bgr:4303 rgb2bgr_with_neon:1021	simple_rgb2bgr:4469 rgb2bgr_with_neon:1162	simple_rgb2bgr:3920 rgb2bgr_with_neon:1017	simple_rgb2bgr:4248 rgb2bgr_with_neon:1019	simple_rgb2bgr:3966 rgb2bgr_with_neon:1018

從統計結果中可以看出，

rgb2bgr_with_neon()

函數的執行速度比

simple_rgb2bgr()

函數快 3 到 4 倍。

那麼 NEON 版的實作為什麼能達到這樣一個加速呢？下面從 cache 的角度進行分析。

統計 cache 性能資料：

--------------------------------------------------------------------------------
I1 cache:         16384 B, 64 B, 4-way associative
D1 cache:         16384 B, 64 B, 4-way associative
LL cache:         262144 B, 64 B, 8-way associative
Command:          ./main_Og
Data file:        cachegrind.out.22890
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   main.cpp
Auto-annotation:  on

--------------------------------------------------------------------------------
         Ir  I1mr  ILmr         Dr    D1mr    DLmr         Dw    D1mw    DLmw 
--------------------------------------------------------------------------------
139,881,731 1,705 1,509 17,482,366 405,359 398,173 17,179,116 196,982 195,984  PROGRAM TOTALS

--------------------------------------------------------------------------------
        Ir I1mr ILmr         Dr    D1mr    DLmr         Dw    D1mw    DLmw  file:function
--------------------------------------------------------------------------------
66,355,214    3    3 12,441,600 194,402 194,402          0       0       0  /home/test/main.cpp:check(unsigned char const*, int, int)
49,766,412    1    1          0       0       0 12,441,600 194,401 194,401  /home/test/main.cpp:init(unsigned char*, int, int)
20,736,006    2    2  4,147,200  97,201  97,201  4,147,200       0       0  /home/test/main.cpp:simple_rgb2bgr(unsigned char*, int, int)
   648,017    3    3          0       0       0          0       0       0  /home/test/main.cpp:rgb2bgr_with_neon(unsigned char*, int, int)
   648,000    0    0    388,800  97,201  97,201    388,800       0       0  /usr/lib/gcc/aarch64-linux-gnu/5/include/arm_neon.h:rgb2bgr_with_neon(unsigned char*, int, int)
   548,801   12   11    149,830   2,085   1,743     50,617      36      24  /build/glibc-BinVK7/glibc-2.23/elf/dl-lookup.c:_dl_lookup_symbol_x
   541,307   42   41    192,949   5,227   1,153     94,348      53      33  /build/glibc-BinVK7/glibc-2.23/elf/dl-lookup.c:do_lookup_x
   211,533   28   28     48,140   3,391   2,964     20,629   1,748     887  /build/glibc-BinVK7/glibc-2.23/elf/../sysdeps/aarch64/dl-machine.h:_dl_relocate_object

--------------------------------------------------------------------------------
-- User-annotated source: main.cpp
--------------------------------------------------------------------------------
  No information has been collected for main.cpp

--------------------------------------------------------------------------------
-- Auto-annotated source: /home/test/main.cpp
--------------------------------------------------------------------------------
        Ir I1mr ILmr        Dr    D1mr    DLmr        Dw   D1mw   DLmw 

-- line 20 ----------------------------------------
         .    .    .         .       .       .         .      .      .  }                                               \
         .    .    .         .       .       .         .      .      .  else                                            \
         .    .    .         .       .       .         .      .      .  {                                               \
         .    .    .         .       .       .         .      .      .      std::cout << "Error:img.\n";                \
         .    .    .         .       .       .         .      .      .  }
         .    .    .         .       .       .         .      .      .  
         .    .    .         .       .       .         .      .      .  void init(uint8_t *img, int height, int width)
         .    .    .         .       .       .         .      .      .  {
         4    0    0         0       0       0         0      0      0      const int total_bytes = height * width * CHANNELS;
12,441,608    1    1         0       0       0         0      0      0      for (int i = 0; i < total_bytes; )
         .    .    .         .       .       .         .      .      .      {
12,441,600    0    0         0       0       0 4,147,200 64,801 64,801          *(img + i++) = RED;
12,441,600    0    0         0       0       0 4,147,200 64,800 64,800          *(img + i++) = GREEN;
12,441,600    0    0         0       0       0 4,147,200 64,800 64,800          *(img + i++) = BLUE;
         .    .    .         .       .       .         .      .      .      }
         .    .    .         .       .       .         .      .      .  }
         .    .    .         .       .       .         .      .      .  
         .    .    .         .       .       .         .      .      .  bool check(const uint8_t *img, int height, int width)
         .    .    .         .       .       .         .      .      .  {
         4    0    0         0       0       0         0      0      0      const int total_bytes = height * width * CHANNELS;
16,588,806    3    3         0       0       0         0      0      0      for (int i = 0; i < total_bytes; i += CHANNELS)
         .    .    .         .       .       .         .      .      .      {
41,472,000    0    0 8,294,400 129,602 129,602         0      0      0          if (*(img + i) != BLUE || *(img + i + 1) != GREEN || 
 8,294,400    0    0 4,147,200  64,800  64,800         0      0      0                  *(img + i + 2) != RED)
         .    .    .         .       .       .         .      .      .          {
         .    .    .         .       .       .         .      .      .              return false;
         .    .    .         .       .       .         .      .      .          }
         .    .    .         .       .       .         .      .      .      }
         4    0    0         0       0       0         0      0      0      return true;
         .    .    .         .       .       .         .      .      .  }
         .    .    .         .       .       .         .      .      .  
         .    .    .         .       .       .         .      .      .  void simple_rgb2bgr(uint8_t *img, int height, int width)
         .    .    .         .       .       .         .      .      .  {
         2    1    1         0       0       0         0      0      0      const int total_bytes = height * width * CHANNELS;
 8,294,404    0    0         0       0       0         0      0      0      for (int i = 0; i < total_bytes; i += CHANNELS)
         .    .    .         .       .       .         .      .      .      {
         .    .    .         .       .       .         .      .      .          // swap R and B channel
 4,147,200    0    0 2,073,600  32,401  32,401         0      0      0          const auto r_channel = *(img + i);
 6,220,800    1    1 2,073,600  64,800  64,800 2,073,600      0      0          *(img + i) = *(img + i + 2);
 2,073,600    0    0         0       0       0 2,073,600      0      0          *(img + i + 2) = r_channel;
         .    .    .         .       .       .         .      .      .      }
         .    .    .         .       .       .         .      .      .  }
         .    .    .         .       .       .         .      .      .  
         .    .    .         .       .       .         .      .      .  void rgb2bgr_with_neon(uint8_t *img, int height, int width)
         .    .    .         .       .       .         .      .      .  {
         2    1    1         0       0       0         0      0      0      const int total_bytes = height * width * CHANNELS;
         .    .    .         .       .       .         .      .      .      const int stride_bytes = 48;
         7    0    0         0       0       0         0      0      0      const int left_bytes = total_bytes % stride_bytes;
         1    1    1         0       0       0         0      0      0      const int multiply_bytes = total_bytes - left_bytes;
         .    .    .         .       .       .         .      .      .  
   518,404    0    0         0       0       0         0      0      0      for (int i = 0; i < multiply_bytes; i += stride_bytes)
         .    .    .         .       .       .         .      .      .      {
   129,600    0    0         0       0       0         0      0      0          uint8_t *target = img + i;
         .    .    .         .       .       .         .      .      .  
         .    .    .         .       .       .         .      .      .          // swap R and B channel with NEON
         .    .    .         .       .       .         .      .      .          uint8x16x3_t a = vld3q_u8(target);
         .    .    .         .       .       .         .      .      .          uint8x16x3_t b;
         .    .    .         .       .       .         .      .      .          b.val[0] = a.val[2];
         .    .    .         .       .       .         .      .      .          b.val[1] = a.val[1];
         .    .    .         .       .       .         .      .      .          b.val[2] = a.val[0];
         .    .    .         .       .       .         .      .      .          vst3q_u8(target, b);
         .    .    .         .       .       .         .      .      .      }
         .    .    .         .       .       .         .      .      .  
         .    .    .         .       .       .         .      .      .      // handling non-multiply array lengths
         3    1    1         0       0       0         0      0      0      for (int i = multiply_bytes; i < total_bytes; i += CHANNELS)
         .    .    .         .       .       .         .      .      .      {
         .    .    .         .       .       .         .      .      .          const auto r_channel = *(img + i);
         .    .    .         .       .       .         .      .      .          *(img + i) = *(img + i + 2);
         .    .    .         .       .       .         .      .      .          *(img + i + 2) = r_channel;
         .    .    .         .       .       .         .      .      .      }
         .    .    .         .       .       .         .      .      .  }
         .    .    .         .       .       .         .      .      .  
         .    .    .         .       .       .         .      .      .  uint8_t *rgb_img1 = nullptr;
         .    .    .         .       .       .         .      .      .  uint8_t *rgb_img2 = nullptr;
         .    .    .         .       .       .         .      .      .  
         .    .    .         .       .       .         .      .      .  int main()
         7    1    1         1       0       0         5      0      0  {
         7    1    1         0       0       0         1      1      1      rgb_img1 = new uint8_t[HEIGHT * WIDTH * CHANNELS];
         3    0    0         0       0       0         1      0      0      rgb_img2 = new uint8_t[HEIGHT * WIDTH * CHANNELS];
         .    .    .         .       .       .         .      .      .  
         4    0    0         1       0       0         0      0      0      init(rgb_img1, HEIGHT, WIDTH);
         4    1    1         1       1       1         0      0      0      init(rgb_img2, HEIGHT, WIDTH);
         .    .    .         .       .       .         .      .      .  
         6    1    1         1       1       1         0      0      0      simple_rgb2bgr(rgb_img1, HEIGHT, WIDTH);
         6    0    0         1       1       1         0      0      0      rgb2bgr_with_neon(rgb_img2, HEIGHT, WIDTH);
         .    .    .         .       .       .         .      .      .  
        12    1    1         1       1       1         0      0      0      CHECK(rgb_img1, HEIGHT, WIDTH);
        14    1    1         1       1       1         0      0      0      CHECK(rgb_img2, HEIGHT, WIDTH);
         .    .    .         .       .       .         .      .      .      
         5    1    1         1       1       1         0      0      0      delete [] rgb_img2;
         5    0    0         1       0       0         0      0      0      delete [] rgb_img1;
         .    .    .         .       .       .         .      .      .      return 0;
        27    4    3        11       3       1         5      0      0  }

--------------------------------------------------------------------------------
-- Auto-annotated source: /usr/lib/gcc/aarch64-linux-gnu/5/include/arm_neon.h
--------------------------------------------------------------------------------
     Ir I1mr ILmr      Dr   D1mr   DLmr      Dw D1mw DLmw 

-- line 16243 ----------------------------------------
      .    .    .       .      .      .       .    .    .    return ret;
      .    .    .       .      .      .       .    .    .  }
      .    .    .       .      .      .       .    .    .  
      .    .    .       .      .      .       .    .    .  __extension__ static __inline uint8x16x3_t __attribute__ ((__always_inline__))
      .    .    .       .      .      .       .    .    .  vld3q_u8 (const uint8_t * __a)
      .    .    .       .      .      .       .    .    .  {
      .    .    .       .      .      .       .    .    .    uint8x16x3_t ret;
      .    .    .       .      .      .       .    .    .    __builtin_aarch64_simd_ci __o;
129,600    0    0 388,800 97,201 97,201       0    0    0    __o = __builtin_aarch64_ld3v16qi ((const __builtin_aarch64_simd_qi *) __a);
      .    .    .       .      .      .       .    .    .    ret.val[0] = (uint8x16_t) __builtin_aarch64_get_qregciv16qi (__o, 0);
      .    .    .       .      .      .       .    .    .    ret.val[1] = (uint8x16_t) __builtin_aarch64_get_qregciv16qi (__o, 1);
      .    .    .       .      .      .       .    .    .    ret.val[2] = (uint8x16_t) __builtin_aarch64_get_qregciv16qi (__o, 2);
      .    .    .       .      .      .       .    .    .    return ret;
      .    .    .       .      .      .       .    .    .  }
      .    .    .       .      .      .       .    .    .  
      .    .    .       .      .      .       .    .    .  __extension__ static __inline uint16x8x3_t __attribute__ ((__always_inline__))
      .    .    .       .      .      .       .    .    .  vld3q_u16 (const uint16_t * __a)
-- line 16259 ----------------------------------------
-- line 23620 ----------------------------------------
      .    .    .       .      .      .       .    .    .    __o = __builtin_aarch64_set_qregciv2di (__o, (int64x2_t) val.val[2], 2);
      .    .    .       .      .      .       .    .    .    __builtin_aarch64_st3v2di ((__builtin_aarch64_simd_di *) __a, __o);
      .    .    .       .      .      .       .    .    .  }
      .    .    .       .      .      .       .    .    .  
      .    .    .       .      .      .       .    .    .  __extension__ static __inline void __attribute__ ((__always_inline__))
      .    .    .       .      .      .       .    .    .  vst3q_u8 (uint8_t * __a, uint8x16x3_t val)
      .    .    .       .      .      .       .    .    .  {
      .    .    .       .      .      .       .    .    .    __builtin_aarch64_simd_ci __o;
129,600    0    0       0      0      0       0    0    0    __o = __builtin_aarch64_set_qregciv16qi (__o, (int8x16_t) val.val[0], 0);
129,600    0    0       0      0      0       0    0    0    __o = __builtin_aarch64_set_qregciv16qi (__o, (int8x16_t) val.val[1], 1);
129,600    0    0       0      0      0       0    0    0    __o = __builtin_aarch64_set_qregciv16qi (__o, (int8x16_t) val.val[2], 2);
129,600    0    0       0      0      0 388,800    0    0    __builtin_aarch64_st3v16qi ((__builtin_aarch64_simd_qi *) __a, __o);
      .    .    .       .      .      .       .    .    .  }
      .    .    .       .      .      .       .    .    .  
      .    .    .       .      .      .       .    .    .  __extension__ static __inline void __attribute__ ((__always_inline__))
      .    .    .       .      .      .       .    .    .  vst3q_u16 (uint16_t * __a, uint16x8x3_t val)
      .    .    .       .      .      .       .    .    .  {
      .    .    .       .      .      .       .    .    .    __builtin_aarch64_simd_ci __o;
      .    .    .       .      .      .       .    .    .    __o = __builtin_aarch64_set_qregciv8hi (__o, (int16x8_t) val.val[0], 0);
      .    .    .       .      .      .       .    .    .    __o = __builtin_aarch64_set_qregciv8hi (__o, (int16x8_t) val.val[1], 1);
-- line 23639 ----------------------------------------

--------------------------------------------------------------------------------
The following files chosen for auto-annotation could not be found:
--------------------------------------------------------------------------------
  /build/glibc-BinVK7/glibc-2.23/elf/dl-lookup.c
  /build/glibc-BinVK7/glibc-2.23/elf/../sysdeps/aarch64/dl-machine.h

--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 
--------------------------------------------------------------------------------
99    1    1 97   96   98 99   99   99  percentage of events annotated

分析統計結果：

函數名稱	記憶體讀操作數量(Dr列)	一級資料緩存讀不命中次數(D1mr列)	最後一級資料緩存讀不命中次數(DLmr列)	記憶體寫操作數量(Dw列)	一級資料緩存寫不命中次數(D1mw列)	最後一級資料緩存寫不命中次數(DLmw列)
simple_rgb2bgr	4,147,200	97,201	97,201	4,147,200
rgb2bgr_with_neon	388,800	97,201	97,201	388,800

可以看出，

rgb2bgr_with_neon

函數的記憶體讀操作數量和記憶體寫操作數量分别為

simple_rgb2bgr

函數的 0.09375 倍，其他的都是相同的。是以，從 cache 的角度來看，NEON 大大減少了記憶體讀寫操作數量，進而改善程式性能。

如果你覺得本文對你有所幫助，歡迎關注公衆号，支援一下！

性能優化篇（4）：NEON優化案例——圖像顔色轉換之RGB到BGR（aarch64版）

性能優化篇（4）：NEON優化案例——圖像顔色轉換之RGB到BGR（aarch64版）

為什麼要交換圖像顔色通道

實作一個基本的圖像顔色通道交換函數

利用 NEON 加速圖像顔色通道交換

繼續閱讀

Android代碼記憶體優化建議-Android資源篇 Android資源優化

Android 記憶體優化 - 禁用DrawingCache減少記憶體消耗

記憶體洩露（七）-- 性能優化的幫助工具Allocation Tracker(Android Studio)

聲學研究：基于SEA模型的整車聲學包優化汽車NVH問題是各大汽車公司關注的重點。對于低頻噪聲分析，廣泛采用有限元分析方法

Android 過度繪制優化

Android性能優化-過度繪制

Go性能調優及相關工具使用（四）——性能調優工具pprof的使用

ORACLE 雜談

MySQL性能優化全攻略

前端性能優化（performance）

前端頁面性能優化，MeterSphere開源持續測試平台釋出v2.10.5 LTS

了解Linux記憶體性能名額前言Linux記憶體性能名額有哪些Linux記憶體是怎麼工作的記憶體性能名額總結參考連結

SQL性能優化前期準備-清除緩存、開啟IO統計

Cesium格式3dtile制作工具

實習心得（二）--關于段錯誤，記憶體洩露，性能瓶頸

遊戲性能優化（基礎）