Assembly 用于浮点相等比较的SIMD指令（使用NaN==NaN）_Assembly_Floating Point_X86_X86 64_Simd

Assembly 用于浮点相等比较的SIMD指令（使用NaN==NaN）

assembly floating-point x86

Assembly 用于浮点相等比较的SIMD指令（使用NaN==NaN）,assembly,floating-point,x86,x86-64,simd,Assembly,Floating Point,X86,X86 64,Simd,哪些指令用于比较由4*32位浮点值组成的两个128位向量是否有指令认为两侧的NaN值相等？如果不是，提供自反性（即NaN等于NaN）的变通方法对性能的影响有多大我听说，与IEEE语义相比，确保自反性将对性能产生重大影响，在IEEE语义中，NaN并不等于NaN本身，我想知道这种影响是否会很大我知道在处理浮点值时，您通常希望使用epsilon比较，而不是精确质量。但这个问题是关于精确相等比较的，例如，您可以使用它来消除散列集中的重复值要求 (NaN, 0, 0, 0) == (NaN, 0,

哪些指令用于比较由4*32位浮点值组成的两个128位向量

是否有指令认为两侧的NaN值相等？如果不是，提供自反性（即NaN等于NaN）的变通方法对性能的影响有多大

我听说，与IEEE语义相比，确保自反性将对性能产生重大影响，在IEEE语义中，NaN并不等于NaN本身，我想知道这种影响是否会很大

我知道在处理浮点值时，您通常希望使用epsilon比较，而不是精确质量。但这个问题是关于精确相等比较的，例如，您可以使用它来消除散列集中的重复值

要求

(NaN, 0, 0, 0) == (NaN, 0, 0, 0) // for all representations of NaN
(-0,  0, 0, 0) == (+0,  0, 0, 0) // equal despite different bitwise representations
(1,   0, 0, 0) == (1,   0, 0, 0)
(0,   0, 0, 0) != (1,   0, 0, 0) // at least one different element => not equal 
(1,   0, 0, 0) != (0,   0, 0, 0)

```
+0
```
和
```
-0
```
必须进行相等的比较
```
NaN
```
必须与自身相等
NaN的不同表示应该是相等的，但如果性能影响太大，则可能会牺牲该需求
如果所有四个浮点元素在两个向量中都相同，则结果应为布尔值，
```
true
```
；如果至少有一个元素不同，则结果应为false。其中
```
true
```
由标量整数
```
1
```
表示，而
```
false
```
由
```
0
```
表示

测试用例

(NaN, 0, 0, 0) == (NaN, 0, 0, 0) // for all representations of NaN
(-0,  0, 0, 0) == (+0,  0, 0, 0) // equal despite different bitwise representations
(1,   0, 0, 0) == (1,   0, 0, 0)
(0,   0, 0, 0) != (1,   0, 0, 0) // at least one different element => not equal 
(1,   0, 0, 0) != (0,   0, 0, 0)

我的实施想法

(NaN, 0, 0, 0) == (NaN, 0, 0, 0) // for all representations of NaN
(-0,  0, 0, 0) == (+0,  0, 0, 0) // equal despite different bitwise representations
(1,   0, 0, 0) == (1,   0, 0, 0)
(0,   0, 0, 0) != (1,   0, 0, 0) // at least one different element => not equal 
(1,   0, 0, 0) != (0,   0, 0, 0)

我认为可以使用

和

将两个

NotLessThan

比较（

CMPNLTPS

？）结合起来，以达到预期的结果。相当于

AllTrue（！（x

或AllFalse（（xx）
的汇编程序
背景
这个问题的背景是微软计划在.NET中添加一个向量类型。我主张使用自反的.Equals
方法，并且需要更清楚地了解这个自反Equals对IEEE Equals的性能影响有多大。请参阅关于programmers.se的详细信息。
这里有一个可能的解决方案-它是但效率不高，需要6条说明：
\uuum128 v0，v1；//浮点向量
__m128 v0nan=_mm_cmpeq_ps（v0，v0）；//测试nan的v0
__m128 v1nan=_mm_cmpeq_ps（v1，v1）；//测试nan的v1
__m128 vnan=_mm_或_si128（v0nan，v1nan）；//组合
__m128 vcmp=_mm_cmpneq_ps（v0，v1）；//比较浮点数
vcmp=_-mm_和_-si128（vcmp，vnan）；//组合NaN测试
bool cmp=_mm_testz_si128（vcmp，vcmp）；//如果全部相等，则返回true

请注意，上面的所有逻辑都是反向的，这可能会使代码有点难以理解（或
s实际上是和
s，反之亦然）。
甚至是AVX VCMPPS（它大大增强了谓词的选择）没有给我们一个单一的指令谓词。你必须至少做两次比较并合并结果。不过这还不算太糟

不同的NaN编码不相等：有效地增加了2个INSN（添加了2个UOP）。没有AVX:1个额外的movap

不同的NaN编码相等：有效地增加4个insn（添加4个UOP）。没有AVX：两个额外的MOVAP
insn


IEEE比较和分支是3个uop:cmpeqps
/movmskps
/测试和分支。Intel和AMD都将测试和分支融合到单个uop/m-op中
对于AVX512：按位NaN可能只是一条额外的指令，因为法向量比较和分支可能使用vcmpEQ_OQps
/ktest same，same
/jcc
，所以组合两个不同的掩码寄存器是免费的（只需将args更改为ktest
）。唯一的成本是额外的vpcmpeqd k2、xmm0、xmm1

AVX512 any NaN只是两条额外的指令（2xVFPCLASSPS
，第二条指令使用第一条指令的结果作为零掩码。请参见下文）。再次使用两个不同的参数设置标志

到目前为止，我最好的想法是：ieee|u equal||bitwise_equal
如果我们放弃考虑不同的NaN编码彼此相等：

按位equal捕获两个相同的NaN
IEEE equal捕获了+0==-0
情况

在任何情况下，比较都不会给出假阳性（因为当任一操作数为NaN时，ieee_equal
为假：我们想要的是相等，而不是相等或无序。AVXvcmpps
提供这两个选项，而SSE只提供简单的相等操作。）
我们想知道什么时候所有元素都相等，所以我们应该从反向比较开始。检查至少一个非零元素要比检查所有元素都非零更容易。（例如，水平和硬、水平或容易（pmovskb
/test
，或ptest
）与之相反的是，比较是免费的（jnz
而不是jz
）。这与Paul R使用的技巧相同
; inputs in xmm0, xmm1
movaps    xmm2, xmm0    ; unneeded with 3-operand AVX instructions

cmpneqps  xmm2, xmm1    ; 0:A and B are ordered and equal.  -1:not ieee_equal.  predicate=NEQ_UQ in VEX encoding expanded notation
pcmpeqd   xmm0, xmm1    ; -1:bitwise equal  0:otherwise

; xmm0   xmm2
;   0      0   -> equal   (ieee_equal only)
;   0     -1   -> unequal (neither)
;  -1      0   -> equal   (bitwise equal and ieee_equal)
;  -1     -1   -> equal   (bitwise equal only: only happens when both are NaN)

andnps    xmm0, xmm2    ; NOT(xmm0) AND xmm2
; xmm0 elements are -1 where  (not bitwise equal) AND (not IEEE equal).
; xmm0 all-zero iff every element was bitwise or IEEE equal, or both
movmskps  eax, xmm0
test      eax, eax      ; it's too bad movmsk doesn't set EFLAGS according to the result
jz no_differences

对于双精度，…PS
和pcmpeqQ
的工作原理相同
如果不相等代码继续找出哪个元素不相等，则对movmskps
结果进行位扫描将给出第一个差异的位置
使用SSE4.1PTEST您可以用以下内容替换和NPS
/movmskps
/测试和分支：
ptest    xmm0, xmm2   ; CF =  0 == (NOT(xmm0) AND xmm2).
jc no_differences

我希望这是大多数人第一次看到CF
的PTEST
结果对任何事情都有用。：）
在Intel和AMD CPU上仍然有三个UOP（（2ptest+1jcc）和（pandn+movmsk+fused test&branch）），但指令更少。如果要setcc
或cmovcc，则效率更高<
; inputs in xmm0, xmm1
movaps      xmm2, xmm0
cmpunordps  xmm2, xmm2      ; find NaNs in A (-1:NaN  0:anything else)
movaps      xmm3, xmm1
cmpunordps  xmm3, xmm3      ; find NaNs in B
andps       xmm2, xmm3      ; xmm2 = (-1:both NaN  0:anything else)
; now in the same boat as before: xmm2 is set for elements we want to consider equal, even though they're not IEEE equal

cmpeqps     xmm0, xmm1      ; -1:ieee_equal  0:unordered or unequal
; xmm0   xmm2 
;  -1      0     -> equal   (ieee_equal)
;  -1     -1     -> equal   (ieee_equal and both NaN (impossible))
;   0      0     -> unequal (neither)
;   0     -1     -> equal   (both NaN)

orps        xmm0, xmm2      ; 0: unequal.  -1:reflexive_equal
movmskps    eax, xmm0
test        eax, eax
jnz  equal_reflexive

; inputs in A:xmm0 B:xmm1
movaps      xmm2, xmm0
cmpordps    xmm2, xmm2      ; find NaNs in A.  (0: NaN.  -1: anything else).  Same as cmpeqps since src and dest are the same.
; cmpunordps wouldn't be useful: NaN stays NaN, while other values are zeroed.  (This could be useful if ORPS didn't exist)

; integer -1 (all-ones) is a NaN encoding, but all-zeros is 0.0
cmpunordps  xmm2, xmm1
; A:NaN B:0   ->  0   unord 0   -> false
; A:0   B:NaN ->  NaN unord NaN -> true

; A:0   B:0   ->  NaN unord 0   -> true
; A:NaN B:NaN ->  0   unord NaN -> true

; Desired:   0 where A and B are both NaN.

; I think this works

;  0x81 = CLASS_QNAN|CLASS_SNAN (first and last bits of the imm8)
VFPCLASSPS    k1,     zmm0, 0x81 ; k1 = 1:NaN in A.   0:non-NaN
VFPCLASSPS    k2{k1}, zmm1, 0x81 ; k2 = 1:NaNs in BOTH
;; where A doesn't have a NaN, k2 will be zero because of the zeromask
;; where B doesn't have a NaN, k2 will be zero because that's the FPCLASS result
;; so k2 is like the bitwise-equal result from pcmpeqd: it's an override for ieee_equal

vcmpNEQ_UQps  k3, zmm0, zmm1
;; k3= 0 only where IEEE equal (because of cmpneqps normal operation)

;  k2   k3   ; same logic table as the pcmpeqd bitwise-NaN version
;  0    0    ->  equal   (ieee equal)
;  0    1    ->  unequal (neither)
;  1    0    ->  equal   (ieee equal and both-NaN (impossible))
;  1    1    ->  equal   (both NaN)

;  not(k2) AND k3 is true only when the element is unequal (bitwise and ieee)

KTESTW        k2, k3    ; same as PTEST: set CF from 0 == (NOT(k2) AND k2)
jc .reflexive_equal

;;; Demonstrate that it's hard (probably impossible) to avoid using any k... instructions
vcmpneq_uqps  k1,    zmm0, zmm1   ; 0:ieee equal   1:unequal or unordered

vfpclassps    k2{k1}, zmm0, 0x81   ; 0:ieee equal or A is NaN.  1:unequal
vfpclassps    k2{k2}, zmm1, 0x81   ; 0:ieee equal | A is NaN | B is NaN.  1:unequal
;; This is just a slow way to do vcmpneq_Oqps: ordered and unequal.

vfpclassps    k3{k1}, zmm0, ~0x81  ; 0:ieee equal or A is not NaN.  1:unequal and A is NaN
vfpclassps    k3{k3}, zmm1, ~0x81  ; 0:ieee equal | A is not NaN | B is not NaN.  1:unequal & A is NaN & B is NaN
;; nope, mixes the conditions the wrong way.
;; The bits that remain set don't have any information from vcmpneqps left: both-NaN is always ieee-unequal.

VFPCLASSPS      k1,     zmm0, 0x81 ; k1 = set where there are NaNs in A
VFPCLASSPS      k2{k1}, zmm1, 0x81 ; k2 = set where there are NaNs in BOTH
;; where A doesn't have a NaN, k2 will be zero because of the zeromask
;; where B doesn't have a NaN, k2 will be zero because that's the FPCLASS result
vcmpEQ_OQps     k3, zmm0, zmm1
;; k3= 1 only where IEEE equal and ordered (cmpeqps normal operation)

;  k3   k2
;  1    0    ->  equal   (ieee equal)
;  1    1    ->  equal   (ieee equal and both-NaN (impossible))
;  0    0    ->  unequal (neither)
;  0    1    ->  equal   (both NaN)

KORTESTW        k3, k2  ; CF = set iff k3|k2 is all-ones.
jc .reflexive_equal

;inputs in xmm0:A  xmm1:B
movaps    xmm2, xmm0
pcmpeqd   xmm2, xmm1     ; xmm2=bitwise_equal.  (0:unequal -1:equal)

por       xmm0, xmm1
paddD     xmm0, xmm0     ; left-shift by 1 (one byte shorter than pslld xmm0, 1, and can run on more ports).

; xmm0=all-zero only in the +/- 0 case (where A and B are IEEE equal)

; xmm2     xmm0          desired result (0 means "no difference found")
;  -1       0        ->      0          ; bitwise equal and +/-0 equal
;  -1     non-zero   ->      0          ; just bitwise equal
;   0       0        ->      0          ; just +/-0 equal
;   0     non-zero   ->      non-zero   ; neither

ptest     xmm2, xmm0         ; CF = ( (not(xmm2) AND xmm0) == 0)
jc  reflexive_equal

; inputs in xmm0, xmm1
movaps   xmm2, xmm0
cmpeqps  xmm2, xmm1    ; -1:ieee_equal.  EQ_OQ predicate in the expanded notation for VEX encoding
pcmpeqd  xmm0, xmm1    ; -1:bitwise equal
orps     xmm0, xmm2
; xmm0 = -1:(where an element is bitwise or ieee equal)   0:elsewhere

movmskps eax, xmm0
test     eax, eax
jnz at_least_one_equal
; else  all different

// UNFINISHED start of an idea
bitdiff = _mm_xor_si128(A, B);
signbitdiff = _mm_srai_epi32(bitdiff, 31);   // broadcast the diff in sign bit to the whole vector
signbitdiff = _mm_srli_epi32(bitdiff, 1);    // zero the sign bit
something = _mm_and_si128(bitdiff, signbitdiff);