Performance AVX mat4 inv的实施比SSE慢
我在SSE2和AVX中实现了4x4矩阵求逆。两者都比普通实现快。但如果启用了AVX(-mavx),则SSE2实现的运行速度比手动AVX实现快。似乎编译器使我的SSE2实现与AVX更加友好:( 在我的AVX实现中,乘法更少,加法更少…所以我希望AVX可以比SSE更快。也许像Performance AVX mat4 inv的实施比SSE慢,performance,matrix,intel,sse,avx,Performance,Matrix,Intel,Sse,Avx,我在SSE2和AVX中实现了4x4矩阵求逆。两者都比普通实现快。但如果启用了AVX(-mavx),则SSE2实现的运行速度比手动AVX实现快。似乎编译器使我的SSE2实现与AVX更加友好:( 在我的AVX实现中,乘法更少,加法更少…所以我希望AVX可以比SSE更快。也许像\u mm256\u permute2f128\u ps,\u mm256\u permutevar\u ps/\u mm256\u permute\u ps这样的指令会使AVX变慢?我没有尝试将SSE/XMM寄存器加载到AVX
\u mm256\u permute2f128\u ps
,\u mm256\u permutevar\u ps/\u mm256\u permute\u ps
这样的指令会使AVX变慢?我没有尝试将SSE/XMM寄存器加载到AVX/YMM寄存器
如何使我的AVX实现比SSE更快
我的CPU:Intel(R)Core(TM)i7-3615QM CPU@2.30GHz(常春藤桥)
初始矩阵:
Matrix (float4x4):
|0.0714 -0.6589 0.7488 2.0000|
|0.9446 0.2857 0.1613 4.0000|
|-0.3202 0.6958 0.6429 6.0000|
|0.0000 0.0000 0.0000 1.0000|
测试代码:
start = clock();
for (int i = 0; i < 1000000; i++) {
glm_mat4_inv_sse2(m, m);
// glm_mat4_inv_avx(m, m);
// glm_mat4_inv(m, m)
}
end = clock();
total = (float)(end - start) / CLOCKS_PER_SEC;
printf("%f secs\n\n", total);
SSE2实现(启用AVX)输出(使用锁销;选项-O3-mavx):
AVX实现输出(使用godbolt;选项-O3-mavx):
编辑:
我在macOS(MacBook Pro(Retina,2012年年中)15')上使用Xcode(版本10.0(10A255))来构建和运行带有-O3优化选项的测试。它使用clang编译测试代码。我在godbolt中使用GCC 8.2来查看asm(很抱歉),但程序集输出似乎类似
我是通过启用cglm选项启用shuffd的:cglm_USE_INT_DOMAIN。我在查看asm时忘记禁用它
#ifdef CGLM_USE_INT_DOMAIN
# define glmm_shuff1(xmm, z, y, x, w) \
_mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(xmm), \
_MM_SHUFFLE(z, y, x, w)))
#else
# define glmm_shuff1(xmm, z, y, x, w) \
_mm_shuffle_ps(xmm, xmm, _MM_SHUFFLE(z, y, x, w))
#endif
整个测试代码(标题除外):
新版本(SSE2版本是这样使用的):
新版本(快速版):
现在它比以前更好了,但仍然不比常春藤网桥CPU上的SSE快。我更新了测试结果。您的CPU是Intel IvyBridge Sandybridge/IvyBridge在不同的端口上具有每时钟1个mul并增加吞吐量,因此它们不会相互竞争 但对于256位洗牌和所有FP洗牌(甚至128位
shufps
),它的每时钟洗牌吞吐量只有1.但是,对于整数洗牌,它有2个时钟的吞吐量,我注意到您的编译器正在使用pshufd
作为FP指令之间的复制和洗牌。在为SSE2编译时,这是一个实实在在的胜利,特别是在VEX编码不可用的情况下(因此,它通过替换movaps xmm0、xmm1
/shufps xmm0、xmm0、65
或任何东西来保存movaps
)即使在AVX可用的情况下,您的编译器也在这样做,因此它本可以使用vshufps xmm0、xmm1、xmm1、65
,但出于微体系结构的原因,它要么聪明地选择了vpshufd
,要么它很幸运,要么它的启发式/指令成本模型就是根据这一点设计的。(我怀疑它是叮当作响的,但您没有在问题中说,也没有显示您编译的C源代码)
在Haswell和更高版本中(支持AVX2,因此支持每个整数洗牌的256位版本),所有洗牌只能在端口5上运行。但在仅支持AVX1的IvB中,只有FP洗牌才能达到256位。整数洗牌始终只有128位,并且可以在端口1或端口5上运行,因为这两个端口上都有128位洗牌执行单元。()
因为asm很长,所以我没有详细研究过它,但是如果使用更宽的向量来节省加法/乘法的开销,那么速度会更慢 此外,因为您的所有洗牌都变成FP洗牌,所以它们只在端口5上运行,而没有利用端口1。我怀疑洗牌太多,与端口0(FP乘法)或端口1(FP加法)相比,这是一个瓶颈 顺便说一句,Haswell和更高版本有两个FMA单元,p0和p1上各有一个,因此乘法运算的吞吐量是FMA单元的两倍。Skylake和更高版本也在这些FMA单元上运行FP add运算,因此它们每个时钟的吞吐量都是2倍。(如果你能有效地使用实际的FMA指令,你可以完成两倍的工作。) 此外,您的基准测试的是延迟,而不是吞吐量,因为输入和输出都是相同的
m
。不过,可能有足够的指令级并行性来限制混洗吞吐量
像vperm2f128
和vinsertf128
这样的通道交叉混洗在IvB上有两个周期延迟,而通道内混洗(包括所有128位混洗)只有一个周期延迟。Intel的指南声称不同的数字IIRC,但Agner Fog的实际测量结果是在依赖链中发现的两个周期。(这可能是1个周期+某种旁路延迟)。在Haswell和更高版本上,车道交叉洗牌是3个周期延迟
同样相关:有时,你可以使用一个未对齐的负载减少洗牌量,该负载在一个有用的点上分成128位的一半,然后用于车道洗牌。这对AVX1可能有用,因为它缺少
vpermps
或其他粒度小于128位的车道交叉洗牌。回答得好!我添加了更多细节到问题的底部,在使用clang测试时,我使用GCC查看asm,对此表示抱歉。是否可以在另一个端口(例如整型域)上使用\u mm256\u permutevar\u ps
,\u mm256\u shuffle\u ps
,使其速度更快(我将查看《英特尔Intrinsics指南》)?你有什么建议,我应该保留AVX版本,还是放弃AVX版本,将来使用SSE版本?@recp:不,没有任何英特尔CPU的每时钟吞吐量超过1个,用于任何256位洗牌。快速128位洗牌是Nehalem通过IvB提供的一项功能。构建宽洗牌单元需要大量晶体管,而英特尔削减了n为Haswell洗牌硬件。@recp:如果可能的话,您应该在Haswell和/或Skylake上进行基准测试。我还没有尝试手动或使用IACA()对您的代码进行静态分析要查看哪一个有更多的总洗牌,但如果它们在Haswell上的速度大致相同,则使用IvB上速度最快的一个。或者Haswell的AVX2+FMA版本可能会让您在跨车道256位洗牌时加速,但请确保您尝试在中分别添加FMA和AVX2
glm_mat4_inv_sse2:
movaps xmm8, XMMWORD PTR [rdi+32]
movaps xmm2, XMMWORD PTR [rdi+16]
movaps xmm5, XMMWORD PTR [rdi+48]
movaps xmm6, XMMWORD PTR [rdi]
movaps xmm4, xmm8
movaps xmm13, xmm8
movaps xmm11, xmm8
shufps xmm11, xmm2, 170
shufps xmm4, xmm5, 238
movaps xmm3, xmm11
movaps xmm1, xmm8
pshufd xmm12, xmm4, 127
shufps xmm13, xmm2, 255
movaps xmm0, xmm13
movaps xmm9, xmm8
pshufd xmm4, xmm4, 42
shufps xmm9, xmm2, 85
shufps xmm1, xmm5, 153
movaps xmm7, xmm9
mulps xmm0, xmm4
pshufd xmm10, xmm1, 42
movaps xmm1, xmm11
shufps xmm5, xmm8, 0
mulps xmm3, xmm12
pshufd xmm5, xmm5, 128
mulps xmm7, xmm12
mulps xmm1, xmm10
subps xmm3, xmm0
movaps xmm0, xmm13
mulps xmm0, xmm10
mulps xmm13, xmm5
subps xmm7, xmm0
movaps xmm0, xmm9
mulps xmm0, xmm4
subps xmm0, xmm1
movaps xmm1, xmm8
movaps xmm8, xmm11
shufps xmm1, xmm2, 0
mulps xmm8, xmm5
movaps xmm11, xmm7
mulps xmm4, xmm1
mulps xmm5, xmm9
movaps xmm9, xmm2
mulps xmm12, xmm1
shufps xmm9, xmm6, 85
pshufd xmm9, xmm9, 168
mulps xmm1, xmm10
movaps xmm10, xmm2
shufps xmm10, xmm6, 0
pshufd xmm10, xmm10, 168
subps xmm4, xmm8
mulps xmm7, xmm10
movaps xmm8, xmm2
shufps xmm2, xmm6, 255
shufps xmm8, xmm6, 170
pshufd xmm8, xmm8, 168
pshufd xmm2, xmm2, 168
mulps xmm11, xmm8
subps xmm12, xmm13
movaps xmm13, XMMWORD PTR .LC0[rip]
subps xmm1, xmm5
movaps xmm5, xmm3
mulps xmm5, xmm9
mulps xmm3, xmm10
subps xmm5, xmm11
movaps xmm11, xmm0
mulps xmm11, xmm2
mulps xmm0, xmm10
addps xmm5, xmm11
movaps xmm11, xmm12
mulps xmm11, xmm8
mulps xmm12, xmm9
xorps xmm5, xmm13
subps xmm3, xmm11
movaps xmm11, xmm4
mulps xmm4, xmm9
subps xmm7, xmm12
mulps xmm11, xmm2
mulps xmm2, xmm1
mulps xmm1, xmm8
subps xmm0, xmm4
addps xmm3, xmm11
movaps xmm11, XMMWORD PTR .LC1[rip]
addps xmm2, xmm7
addps xmm0, xmm1
movaps xmm1, xmm5
xorps xmm3, xmm11
xorps xmm2, xmm13
shufps xmm1, xmm3, 0
xorps xmm0, xmm11
movaps xmm4, xmm2
shufps xmm4, xmm0, 0
shufps xmm1, xmm4, 136
mulps xmm1, xmm6
pshufd xmm4, xmm1, 27
addps xmm1, xmm4
pshufd xmm4, xmm1, 65
addps xmm1, xmm4
movaps xmm4, XMMWORD PTR .LC2[rip]
divps xmm4, xmm1
mulps xmm5, xmm4
mulps xmm3, xmm4
mulps xmm2, xmm4
mulps xmm0, xmm4
movaps XMMWORD PTR [rsi], xmm5
movaps XMMWORD PTR [rsi+16], xmm3
movaps XMMWORD PTR [rsi+32], xmm2
movaps XMMWORD PTR [rsi+48], xmm0
ret
.LC0:
.long 0
.long 2147483648
.long 0
.long 2147483648
.LC1:
.long 2147483648
.long 0
.long 2147483648
.long 0
.LC2:
.long 1065353216
.long 1065353216
.long 1065353216
.long 1065353216
glm_mat4_inv_sse2:
vmovaps xmm9, XMMWORD PTR [rdi+32]
vmovaps xmm6, XMMWORD PTR [rdi+48]
vmovaps xmm2, XMMWORD PTR [rdi+16]
vmovaps xmm7, XMMWORD PTR [rdi]
vshufps xmm5, xmm9, xmm6, 238
vpshufd xmm13, xmm5, 127
vpshufd xmm5, xmm5, 42
vshufps xmm1, xmm9, xmm6, 153
vshufps xmm11, xmm9, xmm2, 170
vshufps xmm12, xmm9, xmm2, 255
vmulps xmm3, xmm11, xmm13
vpshufd xmm1, xmm1, 42
vmulps xmm0, xmm12, xmm5
vshufps xmm10, xmm9, xmm2, 85
vshufps xmm6, xmm6, xmm9, 0
vpshufd xmm6, xmm6, 128
vmulps xmm8, xmm10, xmm13
vmulps xmm4, xmm10, xmm5
vsubps xmm3, xmm3, xmm0
vmulps xmm0, xmm12, xmm1
vsubps xmm8, xmm8, xmm0
vmulps xmm0, xmm11, xmm1
vsubps xmm4, xmm4, xmm0
vshufps xmm0, xmm9, xmm2, 0
vmulps xmm9, xmm12, xmm6
vmulps xmm13, xmm0, xmm13
vmulps xmm5, xmm0, xmm5
vmulps xmm0, xmm0, xmm1
vsubps xmm12, xmm13, xmm9
vmulps xmm9, xmm11, xmm6
vmovaps xmm13, XMMWORD PTR .LC0[rip]
vmulps xmm6, xmm10, xmm6
vshufps xmm10, xmm2, xmm7, 85
vpshufd xmm10, xmm10, 168
vsubps xmm5, xmm5, xmm9
vshufps xmm9, xmm2, xmm7, 170
vpshufd xmm9, xmm9, 168
vsubps xmm1, xmm0, xmm6
vmulps xmm11, xmm8, xmm9
vshufps xmm0, xmm2, xmm7, 0
vshufps xmm2, xmm2, xmm7, 255
vmulps xmm6, xmm3, xmm10
vpshufd xmm2, xmm2, 168
vpshufd xmm0, xmm0, 168
vmulps xmm3, xmm3, xmm0
vmulps xmm8, xmm8, xmm0
vmulps xmm0, xmm4, xmm0
vsubps xmm6, xmm6, xmm11
vmulps xmm11, xmm4, xmm2
vaddps xmm6, xmm6, xmm11
vmulps xmm11, xmm12, xmm9
vmulps xmm12, xmm12, xmm10
vxorps xmm6, xmm6, xmm13
vsubps xmm3, xmm3, xmm11
vmulps xmm11, xmm5, xmm2
vmulps xmm5, xmm5, xmm10
vsubps xmm8, xmm8, xmm12
vmulps xmm2, xmm1, xmm2
vmulps xmm1, xmm1, xmm9
vaddps xmm3, xmm3, xmm11
vmovaps xmm11, XMMWORD PTR .LC1[rip]
vsubps xmm0, xmm0, xmm5
vaddps xmm2, xmm8, xmm2
vxorps xmm3, xmm3, xmm11
vaddps xmm0, xmm0, xmm1
vshufps xmm1, xmm6, xmm3, 0
vxorps xmm2, xmm2, xmm13
vxorps xmm0, xmm0, xmm11
vshufps xmm4, xmm2, xmm0, 0
vshufps xmm1, xmm1, xmm4, 136
vmulps xmm1, xmm1, xmm7
vpshufd xmm4, xmm1, 27
vaddps xmm1, xmm1, xmm4
vpshufd xmm4, xmm1, 65
vaddps xmm1, xmm1, xmm4
vmovaps xmm4, XMMWORD PTR .LC2[rip]
vdivps xmm1, xmm4, xmm1
vmulps xmm6, xmm6, xmm1
vmulps xmm3, xmm3, xmm1
vmulps xmm2, xmm2, xmm1
vmulps xmm1, xmm0, xmm1
vmovaps XMMWORD PTR [rsi], xmm6
vmovaps XMMWORD PTR [rsi+16], xmm3
vmovaps XMMWORD PTR [rsi+32], xmm2
vmovaps XMMWORD PTR [rsi+48], xmm1
ret
.LC0:
.long 0
.long 2147483648
.long 0
.long 2147483648
.LC1:
.long 2147483648
.long 0
.long 2147483648
.long 0
.LC2:
.long 1065353216
.long 1065353216
.long 1065353216
.long 1065353216
glm_mat4_inv_avx:
vmovaps ymm3, YMMWORD PTR [rdi]
vmovaps ymm1, YMMWORD PTR [rdi+32]
vmovdqa ymm2, YMMWORD PTR .LC1[rip]
vmovdqa ymm0, YMMWORD PTR .LC0[rip]
vperm2f128 ymm6, ymm3, ymm3, 3
vperm2f128 ymm5, ymm1, ymm1, 0
vperm2f128 ymm1, ymm1, ymm1, 17
vmovdqa ymm10, YMMWORD PTR .LC4[rip]
vpermilps ymm9, ymm5, ymm0
vpermilps ymm7, ymm1, ymm2
vperm2f128 ymm8, ymm6, ymm6, 0
vpermilps ymm1, ymm1, ymm0
vpermilps ymm5, ymm5, ymm2
vpermilps ymm0, ymm8, ymm0
vmulps ymm4, ymm7, ymm9
vpermilps ymm8, ymm8, ymm2
vpermilps ymm11, ymm6, 1
vmulps ymm2, ymm5, ymm1
vmulps ymm7, ymm0, ymm7
vmulps ymm1, ymm8, ymm1
vmulps ymm0, ymm0, ymm5
vmulps ymm5, ymm8, ymm9
vmovdqa ymm9, YMMWORD PTR .LC3[rip]
vmovdqa ymm8, YMMWORD PTR .LC2[rip]
vsubps ymm4, ymm4, ymm2
vsubps ymm7, ymm7, ymm1
vperm2f128 ymm2, ymm4, ymm4, 0
vperm2f128 ymm4, ymm4, ymm4, 17
vshufps ymm1, ymm2, ymm4, 77
vpermilps ymm1, ymm1, ymm9
vsubps ymm5, ymm0, ymm5
vpermilps ymm0, ymm2, ymm8
vmulps ymm0, ymm0, ymm11
vperm2f128 ymm1, ymm1, ymm2, 0
vshufps ymm2, ymm2, ymm4, 74
vpermilps ymm4, ymm6, 90
vmulps ymm1, ymm1, ymm4
vpermilps ymm2, ymm2, ymm10
vpermilps ymm6, ymm6, 191
vmovaps ymm11, YMMWORD PTR .LC5[rip]
vperm2f128 ymm2, ymm2, ymm2, 0
vperm2f128 ymm4, ymm3, ymm3, 0
vpermilps ymm12, ymm4, YMMWORD PTR .LC7[rip]
vmulps ymm2, ymm2, ymm6
vinsertf128 ymm6, ymm7, xmm5, 1
vperm2f128 ymm5, ymm7, ymm5, 49
vshufps ymm7, ymm6, ymm5, 77
vpermilps ymm9, ymm7, ymm9
vsubps ymm0, ymm0, ymm1
vpermilps ymm1, ymm4, YMMWORD PTR .LC6[rip]
vpermilps ymm4, ymm4, YMMWORD PTR .LC8[rip]
vaddps ymm2, ymm0, ymm2
vpermilps ymm0, ymm6, ymm8
vshufps ymm6, ymm6, ymm5, 74
vpermilps ymm6, ymm6, ymm10
vmulps ymm1, ymm1, ymm0
vmulps ymm0, ymm12, ymm9
vmulps ymm6, ymm4, ymm6
vxorps ymm2, ymm2, ymm11
vdpps ymm3, ymm3, ymm2, 255
vsubps ymm0, ymm1, ymm0
vdivps ymm2, ymm2, ymm3
vaddps ymm0, ymm0, ymm6
vxorps ymm0, ymm0, ymm11
vdivps ymm0, ymm0, ymm3
vperm2f128 ymm5, ymm2, ymm2, 3
vshufps ymm1, ymm2, ymm5, 68
vshufps ymm2, ymm2, ymm5, 238
vperm2f128 ymm4, ymm0, ymm0, 3
vshufps ymm6, ymm0, ymm4, 68
vshufps ymm0, ymm0, ymm4, 238
vshufps ymm3, ymm1, ymm6, 136
vshufps ymm1, ymm1, ymm6, 221
vinsertf128 ymm1, ymm3, xmm1, 1
vshufps ymm3, ymm2, ymm0, 136
vshufps ymm0, ymm2, ymm0, 221
vinsertf128 ymm0, ymm3, xmm0, 1
vmovaps YMMWORD PTR [rsi], ymm1
vmovaps YMMWORD PTR [rsi+32], ymm0
vzeroupper
ret
.LC0:
.long 2
.long 1
.long 1
.long 0
.long 0
.long 0
.long 0
.long 0
.LC1:
.long 3
.long 3
.long 2
.long 3
.long 2
.long 1
.long 1
.long 1
.LC2:
.long 0
.long 0
.long 1
.long 2
.long 0
.long 0
.long 1
.long 2
.LC3:
.long 0
.long 1
.long 1
.long 2
.long 0
.long 1
.long 1
.long 2
.LC4:
.long 0
.long 2
.long 3
.long 3
.long 0
.long 2
.long 3
.long 3
.LC5:
.long 0
.long 2147483648
.long 0
.long 2147483648
.long 2147483648
.long 0
.long 2147483648
.long 0
.LC6:
.long 1
.long 0
.long 0
.long 0
.long 1
.long 0
.long 0
.long 0
.LC7:
.long 2
.long 2
.long 1
.long 1
.long 2
.long 2
.long 1
.long 1
.LC8:
.long 3
.long 3
.long 3
.long 2
.long 3
.long 3
.long 3
.long 2
#ifdef CGLM_USE_INT_DOMAIN
# define glmm_shuff1(xmm, z, y, x, w) \
_mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(xmm), \
_MM_SHUFFLE(z, y, x, w)))
#else
# define glmm_shuff1(xmm, z, y, x, w) \
_mm_shuffle_ps(xmm, xmm, _MM_SHUFFLE(z, y, x, w))
#endif
#include <cglm/cglm.h>
#include <sys/time.h>
#include <time.h>
int
main(int argc, const char * argv[]) {
CGLM_ALIGN(32) mat4 m = GLM_MAT4_IDENTITY_INIT;
double start, end, total;
/* generate invertible matrix */
glm_translate(m, (vec3){1,2,3});
glm_rotate(m, M_PI_2, (vec3){1,2,3});
glm_translate(m, (vec3){1,2,3});
glm_mat4_print(m, stderr);
start = clock();
for (int i = 0; i < 1000000; i++) {
glm_mat4_inv_sse2(m, m);
// glm_mat4_inv_avx(m, m);
// glm_mat4_inv(m, m);
}
end = clock();
total = (float)(end - start) / CLOCKS_PER_SEC;
printf("%f secs\n\n", total);
glm_mat4_print(m, stderr);
}
r1 = _mm256_div_ps(r1, y4);
r2 = _mm256_div_ps(r2, y4);
y5 = _mm256_div_ps(_mm256_set1_ps(1.0f), y4);
r1 = _mm256_mul_ps(r1, y5);
r2 = _mm256_mul_ps(r2, y5);
y5 = _mm256_rcp_ps(y4);
r1 = _mm256_mul_ps(r1, y5);
r2 = _mm256_mul_ps(r2, y5);