编译器中的优化
- 不同编译器的对比
不同编译器的对比
下表对比了不同的编译器的优化效果。
必须强调的是,编译器在不同的测试例子上可能表现不同。下表仅供参考。
优化方法 | Gnu | Clang | Microsoft | Intel |
---|---|---|---|---|
通用优化 | ||||
函数内联 | x | x | x | x |
常量折叠 | x | x | x | x |
常量传播 | x | x | x | x |
循环的常量传播 | x | x | - | - |
指针消除 | x | x | x | x |
公共子表达式消除 | x | x | x | x |
寄存器变量 | x | x | x | x |
Fused multiply and add | x | x | x | x |
生命周期分析 | x | x | x | x |
合并相同的分支 | x | x | x | x |
消除跳转 | x | x | x | x |
尾调用 | x | x | x | x |
移除总为false的分支 | x | x | x | x |
循环展开,数组循环 | x | x | x | x |
循环展开,结构体 | x | x | x | - |
相同循环体代码移动 | x | x | x | x |
数组元素的归纳变量 | x | x | x | x |
整数表达式的归纳变量 | - | x | 1 | x |
浮点表达式的归纳变量 | - | - | - | x |
乘法累加器,整数 | - | x | x | - |
乘法累加器,浮点 | - | x | x | - |
去虚拟化 | x | x | x | x |
Profile-guided optimization | x | x | x | x |
全局程序优化 | x | x | x | x |
整数代数化简 | ||||
a+b = b+a, a*b = b*a (交换律) | x | x | x | x |
(a+b)+c = a+(b+c), (a*b)*c = a*(b*c) (结合律) | - | x | x | - |
a*b + a*c = a*(b+c)(分配律) | x | x | x | x |
a+b+c+d = (a+b)+(c+d) (提高并行) | - | - | - | x |
a*b*c*d = (a*b)*(c*d) (提高并行) | - | x | - | x |
x*x*x*x*x*x*x*x = (((x2)2)2) | x | x | - | x |
a+a+a+a = a*4 | x | x | x | x |
a*x*x*x + b*x*x + c*x + d = ((a*x+b)*x+c)*x + d | x | x | x | x |
-(-a) = a | x | x | x | x |
a-(-b) = a+b | x | x | x | x |
a-a = 0 | x | x | x | x |
a+0 = a | x | x | x | x |
a*0 = 0 | x | x | x | x |
a*1 = a | x | x | x | x |
(-a)*(-b) = a*b | x | x | x | x |
a/a = 1 | x | x | x | - |
a/1 = a | x | x | x | x |
0/a = 0 | x | x | x | - |
乘以常量= 移位和加法 | x | x | x | x |
除以常量 = 乘法和移位 | x | x | x | x |
除以2的次幂 = 移位 | x | x | x | x |
(-a == -b) = (a == b) | x | x | x | - |
(a+c == b+c) = (a==b) | - | x | x | x |
!(a < b) = (a >= b) | x | x | x | x |
(a<b && b<c && a<c) == (a<b && b<c) | x | - | - | - |
浮点代数化简 | ||||
a+b = b+a, a*b = b*a (交换律) | x | x | x | x |
(a+b)+c = a+(b+c)(结合律) | x | x | - | x |
,(a*b)*c = a*(b*c) (结合律) | x | x | - | - |
a*b + a*c = a*(b+c)(分配律) | x | x | x | x |
a+b+c+d = (a+b)+(c+d), a*b*c*d = (a*b)*(c*d) | x | x | - | - |
a*x*x*x + b*x*x + c*x + d = ((a*x+b)*x+c)*x + d | x | x | x | x |
x*x*x*x*x*x*x*x = (((x2)2)2) | x | x | - | - |
a+a+a+a = a*4 | x | x | x | - |
-(-a) = a | x | x | x | x |
a-(-b) = a+b | x | x | x | x |
a-a = 0 | x | x | x | x |
a+0 = a | x | x | x | x |
a*0 = 0 | x | x | x | x |
a*1 = a | x | x | x | x |
(-a)*(-b) = a*b | x | x | x | x |
a/a = 1 | x | x | - | - |
a/1 = a | x | x | x | x |
0/a = 0 | x | x | x | - |
(-a == -b) = (a == b) | x | x | - | - |
(-a > -b) = (a < b) | x | x | x | - |
除以常量 = 乘以倒数 | x | x | x | x |
布尔代数化简 | ||||
没有分支的布尔操作 | x | x | - | 极少 |
a && b = b && a, a||b = b||a (交换律) | x | x | - | x |
a && b && c = a && (b && c) (结合律) | - | - | - | x |
(a&&b)||(a&&c) = a&&(b||c) (分配律) | x | x | - | - |
(a||b)&&(a||c) = a||(b&&c) (分配律) | x | x | - | - |
!(!a) = a | x | x | x | x |
!a && !b = !(a || b) (德摩根定律) | x | x | - | - |
a && !a = false, a || !a = true | x | x | x | x |
a && true = a, a || false = a | x | x | x | x |
a && false = false, a || true = true | x | x | x | x |
a && a = a | x | x | x | x |
(a&&b) || (a&&!b) = a | x | - | x | x |
(a&&b) || (!a&&c) = a ? b : c | - | x | - | - |
(a&&b) || (!a&&c) || (b&&c) = a ? b : c | x | - | x | x |
(a&&b) || (a&&b&&c) = a&&b | x | x | x | x |
(a&&!b) || (!a&&b) = a XOR b | x | x | - | - |
向量寄存器中的位操作代数化简: | ||||
a & b = b & a, a|b = b|a (交换律) | x | x | - | - |
a & b & c = a & (b & c) (结合律) | x | x | - | - |
(a&b)|(a&c) = a&(b|c) (分配律) | x | x | - | - |
(a|b)&(a|c) = a|(b&c) (分配律) | x | x | - | - |
三值逻辑指令 | - | - | - | x |
(a) = a | x | x | - | - |
~a & ~b = ~(a | b) | x | x | - | - |
a & ~a = false, a | ~a = true | x | x | - | - |
a & true = a, a | false = a | x | x | - | - |
a & false = false | x | x | x | x |
, a | true = true | x | x | x | - |
a & a = a, a | a = a | x | x | - | x |
(a&b) | (a&~b) = a | x | x | - | - |
(a&b) | (~a&c) = a ? b : c | x | - | - | - |
(a&b) | (~a&c) | (b&c) = a ? b : c | - | - | - | - |
(a&b) | (a&b&c) = a&b | x | x | - | - |
(a&&~b) | (~a&b) = a ^ b | x | x | - | - |
~a ^ ~b = a ^ b | x | x | - | - |
a <<b<<c = a<<(b+c) | - | - | - | - |
整数向量代数化简: | ||||
a+b = b+a, a*b = b*a (交换律) | x | x | - | - |
(a+b)+c = a+(b+c), (a*b)*c = a*(b*c) (结合律) | x | x | - | - |
a*b + a*c = a*(b+c)(分配律) | x | x | - | - |
a+b+c+d = (a+b)+(c+d) | - | - | - | - |
x*x*x*x*x*x*x*x = (((x2)2)2) | x | x | - | - |
a+a+a+a = a*4 | - | x | - | - |
a*x*x*x + b*x*x + c*x + d = ((a*x+b)*x+c)*x + d | x | x | - | - |
-(-a) = a | x | x | - | - |
a-(-b) = a+b | x | x | - | - |
a-a = 0 | x | x | - | x |
a+0 = a | x | x | - | - |
a*0 = 0 | x | x | - | x |
a*1 = a | x | x | - | - |
(-a)*(-b) = a*b | x | x | - | - |
乘以2的次幂 = 移位 | x | x | x | x |
(-a == -b) = (a == b) | - | x | - | - |
(a+c == b+c) = (a == b) | - | x | - | - |
!(a < b) = (a >= b) | - | - | - | - |
(a<b && b<c && a<c) == (a<b && b<c) | - | - | - | - |
浮点向量代数化简: | ||||
a+b = b+a, a*b = b*a (交换律) | x | x | - | x |
(a+b)+c = a+(b+c), (a*b)*c = a*(b*c)(结合律) | x | x | - | - |
a*b + a*c = a*(b+c)(分配律) | x | x | - | - |
a+b+c+d = (a+b)+(c+d) | - | - | - | - |
x*x*x*x*x*x*x*x = (((x2)2)2) | x | x | - | - |
a+a+a+a = a*4 | - | x | - | 2*a+a+a |
a*x*x*x + b*x*x + c*x + d = ((a*x+b)*x+c)*x + d | x | x | - | x |
-(-a) = a | x | x | - | - |
a-(-b) = a+b | - | - | - | - |
a-a = 0 | x | x | - | x |
a+0 = a | x | x | x | x |
a*0 = 0 | x | x | - | x |
a*1 = a | x | x | - | x |
(-a)*(-b) = a*b | - | - | - | - |
a/a = 1 | x | x | - | - |
a/1 = a | - | x | - | - |
0/a = 0 | x | x | - | - |
除以常量 = 乘以倒数 | - | - | - | - |
(-a == -b) = (a == b) | - | - | - | - |
!(a < b) = (a >= b) | - | - | - | - |
通用向量优化: | ||||
自动向量化 | x | x | 256bit | x |
合并广播到指令 | - | x | - | x |
merge blend into masked instruction | x | x | - | x |
merge conditional zero into masked instruction | x | - | - | x |
合并布尔AND到掩码比较 | x | x | - | x |
消除所有为true的掩码 | x | x | x | x |
消除所有为false的掩码 | x | x | - | x |
表8.1. 不同C++编译器里优化的比较
测试在打开所有相关优化选项时编译在64-bit Windows下的测试代码,包括放宽浮点精度。测试了以下编译器版本:
Gnu C++ v.7.4.0 (2019, Cygwin64).
Clang C++ v.5.0.1(2019, Cygwin64).
Microsoft C++ Compiler v.19.21.27702 (Visual Studio 2019).
Intel C++ Compiler v.19.0.4.245 for Intel64, 2019.
Clang和Gnu编译器是在测试中表现最好的;Microsoft编译器在向量方面表现普通。在自动向量化方面,当前的Microsoft编译器使用256-bit向量而不是512-bit向量。Intel编译器自动使用512-bit向量,但需要指定
/Qopt-zmm-usage:high
。
Clang编译器倾向于过多的展开循环。过多的循环展开会减慢性能,因为它会填满CPU中的微指令缓存或回环缓冲。