Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/assembly/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/video/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Assembly 用于浮点相等比较的SIMD指令(使用NaN==NaN)_Assembly_Floating Point_X86_X86 64_Simd - Fatal编程技术网

Assembly 用于浮点相等比较的SIMD指令(使用NaN==NaN)

Assembly 用于浮点相等比较的SIMD指令(使用NaN==NaN),assembly,floating-point,x86,x86-64,simd,Assembly,Floating Point,X86,X86 64,Simd,哪些指令用于比较由4*32位浮点值组成的两个128位向量 是否有指令认为两侧的NaN值相等?如果不是,提供自反性(即NaN等于NaN)的变通方法对性能的影响有多大 我听说,与IEEE语义相比,确保自反性将对性能产生重大影响,在IEEE语义中,NaN并不等于NaN本身,我想知道这种影响是否会很大 我知道在处理浮点值时,您通常希望使用epsilon比较,而不是精确质量。但这个问题是关于精确相等比较的,例如,您可以使用它来消除散列集中的重复值 要求 (NaN, 0, 0, 0) == (NaN, 0,

哪些指令用于比较由4*32位浮点值组成的两个128位向量

是否有指令认为两侧的NaN值相等?如果不是,提供自反性(即NaN等于NaN)的变通方法对性能的影响有多大

我听说,与IEEE语义相比,确保自反性将对性能产生重大影响,在IEEE语义中,NaN并不等于NaN本身,我想知道这种影响是否会很大

我知道在处理浮点值时,您通常希望使用epsilon比较,而不是精确质量。但这个问题是关于精确相等比较的,例如,您可以使用它来消除散列集中的重复值

要求

(NaN, 0, 0, 0) == (NaN, 0, 0, 0) // for all representations of NaN
(-0,  0, 0, 0) == (+0,  0, 0, 0) // equal despite different bitwise representations
(1,   0, 0, 0) == (1,   0, 0, 0)
(0,   0, 0, 0) != (1,   0, 0, 0) // at least one different element => not equal 
(1,   0, 0, 0) != (0,   0, 0, 0)
  • +0
    -0
    必须进行相等的比较
  • NaN
    必须与自身相等
  • NaN的不同表示应该是相等的,但如果性能影响太大,则可能会牺牲该需求
  • 如果所有四个浮点元素在两个向量中都相同,则结果应为布尔值,
    true
    ;如果至少有一个元素不同,则结果应为false。其中
    true
    由标量整数
    1
    表示,而
    false
    0
    表示
测试用例

(NaN, 0, 0, 0) == (NaN, 0, 0, 0) // for all representations of NaN
(-0,  0, 0, 0) == (+0,  0, 0, 0) // equal despite different bitwise representations
(1,   0, 0, 0) == (1,   0, 0, 0)
(0,   0, 0, 0) != (1,   0, 0, 0) // at least one different element => not equal 
(1,   0, 0, 0) != (0,   0, 0, 0)
我的实施想法

(NaN, 0, 0, 0) == (NaN, 0, 0, 0) // for all representations of NaN
(-0,  0, 0, 0) == (+0,  0, 0, 0) // equal despite different bitwise representations
(1,   0, 0, 0) == (1,   0, 0, 0)
(0,   0, 0, 0) != (1,   0, 0, 0) // at least one different element => not equal 
(1,   0, 0, 0) != (0,   0, 0, 0)
我认为可以使用
将两个
NotLessThan
比较(
CMPNLTPS
?)结合起来,以达到预期的结果。相当于
AllTrue(!(x
AllFalse((xx)
的汇编程序

背景


这个问题的背景是微软计划在.NET中添加一个向量类型。我主张使用自反的
.Equals
方法,并且需要更清楚地了解这个自反Equals对IEEE Equals的性能影响有多大。请参阅关于programmers.se的详细信息。

这里有一个可能的解决方案-它是但效率不高,需要6条说明:

\uuum128 v0,v1;//浮点向量
__m128 v0nan=_mm_cmpeq_ps(v0,v0);//测试nan的v0
__m128 v1nan=_mm_cmpeq_ps(v1,v1);//测试nan的v1
__m128 vnan=_mm_或_si128(v0nan,v1nan);//组合
__m128 vcmp=_mm_cmpneq_ps(v0,v1);//比较浮点数
vcmp=_-mm_和_-si128(vcmp,vnan);//组合NaN测试
bool cmp=_mm_testz_si128(vcmp,vcmp);//如果全部相等,则返回true
请注意,上面的所有逻辑都是反向的,这可能会使代码有点难以理解(
s实际上是
s,反之亦然)。

甚至是AVX VCMPPS(它大大增强了谓词的选择)没有给我们一个单一的指令谓词。你必须至少做两次比较并合并结果。不过这还不算太糟

  • 不同的NaN编码不相等:有效地增加了2个INSN(添加了2个UOP)。没有AVX:1个额外的
    movap

  • 不同的NaN编码相等:有效地增加4个insn(添加4个UOP)。没有AVX:两个额外的
    MOVAP
    insn

IEEE比较和分支是3个uop:
cmpeqps
/
movmskps
/测试和分支。Intel和AMD都将测试和分支融合到单个uop/m-op中

对于AVX512:按位NaN可能只是一条额外的指令,因为法向量比较和分支可能使用
vcmpEQ_OQps
/
ktest same,same
/
jcc
,所以组合两个不同的掩码寄存器是免费的(只需将args更改为
ktest
)。唯一的成本是额外的
vpcmpeqd k2、xmm0、xmm1

AVX512 any NaN只是两条额外的指令(2x
VFPCLASSPS
,第二条指令使用第一条指令的结果作为零掩码。请参见下文)。再次使用两个不同的参数设置标志


到目前为止,我最好的想法是:
ieee|u equal||bitwise_equal
如果我们放弃考虑不同的NaN编码彼此相等:

  • 按位equal捕获两个相同的NaN
  • IEEE equal捕获了
    +0==-0
    情况
在任何情况下,比较都不会给出假阳性(因为当任一操作数为NaN时,
ieee_equal
为假:我们想要的是相等,而不是相等或无序。AVX
vcmpps
提供这两个选项,而SSE只提供简单的相等操作。)

我们想知道什么时候所有元素都相等,所以我们应该从反向比较开始。检查至少一个非零元素要比检查所有元素都非零更容易。(例如,水平和硬、水平或容易(
pmovskb
/
test
,或
ptest
)与之相反的是,比较是免费的(
jnz
而不是
jz
)。这与Paul R使用的技巧相同

; inputs in xmm0, xmm1
movaps    xmm2, xmm0    ; unneeded with 3-operand AVX instructions

cmpneqps  xmm2, xmm1    ; 0:A and B are ordered and equal.  -1:not ieee_equal.  predicate=NEQ_UQ in VEX encoding expanded notation
pcmpeqd   xmm0, xmm1    ; -1:bitwise equal  0:otherwise

; xmm0   xmm2
;   0      0   -> equal   (ieee_equal only)
;   0     -1   -> unequal (neither)
;  -1      0   -> equal   (bitwise equal and ieee_equal)
;  -1     -1   -> equal   (bitwise equal only: only happens when both are NaN)

andnps    xmm0, xmm2    ; NOT(xmm0) AND xmm2
; xmm0 elements are -1 where  (not bitwise equal) AND (not IEEE equal).
; xmm0 all-zero iff every element was bitwise or IEEE equal, or both
movmskps  eax, xmm0
test      eax, eax      ; it's too bad movmsk doesn't set EFLAGS according to the result
jz no_differences
对于双精度,
…PS
pcmpeqQ
的工作原理相同

如果不相等代码继续找出哪个元素不相等,则对
movmskps
结果进行位扫描将给出第一个差异的位置

使用SSE4.1
PTEST
您可以用以下内容替换
和NPS
/
movmskps
/测试和分支:

ptest    xmm0, xmm2   ; CF =  0 == (NOT(xmm0) AND xmm2).
jc no_differences
我希望这是大多数人第一次看到
CF
PTEST
结果对任何事情都有用。:)

在Intel和AMD CPU上仍然有三个UOP((2ptest+1jcc)和(pandn+movmsk+fused test&branch)),但指令更少。如果要
setcc
cmovcc,则效率更高<
; inputs in xmm0, xmm1
movaps      xmm2, xmm0
cmpunordps  xmm2, xmm2      ; find NaNs in A (-1:NaN  0:anything else)
movaps      xmm3, xmm1
cmpunordps  xmm3, xmm3      ; find NaNs in B
andps       xmm2, xmm3      ; xmm2 = (-1:both NaN  0:anything else)
; now in the same boat as before: xmm2 is set for elements we want to consider equal, even though they're not IEEE equal

cmpeqps     xmm0, xmm1      ; -1:ieee_equal  0:unordered or unequal
; xmm0   xmm2 
;  -1      0     -> equal   (ieee_equal)
;  -1     -1     -> equal   (ieee_equal and both NaN (impossible))
;   0      0     -> unequal (neither)
;   0     -1     -> equal   (both NaN)

orps        xmm0, xmm2      ; 0: unequal.  -1:reflexive_equal
movmskps    eax, xmm0
test        eax, eax
jnz  equal_reflexive
; inputs in A:xmm0 B:xmm1
movaps      xmm2, xmm0
cmpordps    xmm2, xmm2      ; find NaNs in A.  (0: NaN.  -1: anything else).  Same as cmpeqps since src and dest are the same.
; cmpunordps wouldn't be useful: NaN stays NaN, while other values are zeroed.  (This could be useful if ORPS didn't exist)

; integer -1 (all-ones) is a NaN encoding, but all-zeros is 0.0
cmpunordps  xmm2, xmm1
; A:NaN B:0   ->  0   unord 0   -> false
; A:0   B:NaN ->  NaN unord NaN -> true

; A:0   B:0   ->  NaN unord 0   -> true
; A:NaN B:NaN ->  0   unord NaN -> true

; Desired:   0 where A and B are both NaN.
; I think this works

;  0x81 = CLASS_QNAN|CLASS_SNAN (first and last bits of the imm8)
VFPCLASSPS    k1,     zmm0, 0x81 ; k1 = 1:NaN in A.   0:non-NaN
VFPCLASSPS    k2{k1}, zmm1, 0x81 ; k2 = 1:NaNs in BOTH
;; where A doesn't have a NaN, k2 will be zero because of the zeromask
;; where B doesn't have a NaN, k2 will be zero because that's the FPCLASS result
;; so k2 is like the bitwise-equal result from pcmpeqd: it's an override for ieee_equal

vcmpNEQ_UQps  k3, zmm0, zmm1
;; k3= 0 only where IEEE equal (because of cmpneqps normal operation)

;  k2   k3   ; same logic table as the pcmpeqd bitwise-NaN version
;  0    0    ->  equal   (ieee equal)
;  0    1    ->  unequal (neither)
;  1    0    ->  equal   (ieee equal and both-NaN (impossible))
;  1    1    ->  equal   (both NaN)

;  not(k2) AND k3 is true only when the element is unequal (bitwise and ieee)

KTESTW        k2, k3    ; same as PTEST: set CF from 0 == (NOT(k2) AND k2)
jc .reflexive_equal
;;; Demonstrate that it's hard (probably impossible) to avoid using any k... instructions
vcmpneq_uqps  k1,    zmm0, zmm1   ; 0:ieee equal   1:unequal or unordered

vfpclassps    k2{k1}, zmm0, 0x81   ; 0:ieee equal or A is NaN.  1:unequal
vfpclassps    k2{k2}, zmm1, 0x81   ; 0:ieee equal | A is NaN | B is NaN.  1:unequal
;; This is just a slow way to do vcmpneq_Oqps: ordered and unequal.

vfpclassps    k3{k1}, zmm0, ~0x81  ; 0:ieee equal or A is not NaN.  1:unequal and A is NaN
vfpclassps    k3{k3}, zmm1, ~0x81  ; 0:ieee equal | A is not NaN | B is not NaN.  1:unequal & A is NaN & B is NaN
;; nope, mixes the conditions the wrong way.
;; The bits that remain set don't have any information from vcmpneqps left: both-NaN is always ieee-unequal.
VFPCLASSPS      k1,     zmm0, 0x81 ; k1 = set where there are NaNs in A
VFPCLASSPS      k2{k1}, zmm1, 0x81 ; k2 = set where there are NaNs in BOTH
;; where A doesn't have a NaN, k2 will be zero because of the zeromask
;; where B doesn't have a NaN, k2 will be zero because that's the FPCLASS result
vcmpEQ_OQps     k3, zmm0, zmm1
;; k3= 1 only where IEEE equal and ordered (cmpeqps normal operation)

;  k3   k2
;  1    0    ->  equal   (ieee equal)
;  1    1    ->  equal   (ieee equal and both-NaN (impossible))
;  0    0    ->  unequal (neither)
;  0    1    ->  equal   (both NaN)

KORTESTW        k3, k2  ; CF = set iff k3|k2 is all-ones.
jc .reflexive_equal
;inputs in xmm0:A  xmm1:B
movaps    xmm2, xmm0
pcmpeqd   xmm2, xmm1     ; xmm2=bitwise_equal.  (0:unequal -1:equal)

por       xmm0, xmm1
paddD     xmm0, xmm0     ; left-shift by 1 (one byte shorter than pslld xmm0, 1, and can run on more ports).

; xmm0=all-zero only in the +/- 0 case (where A and B are IEEE equal)

; xmm2     xmm0          desired result (0 means "no difference found")
;  -1       0        ->      0          ; bitwise equal and +/-0 equal
;  -1     non-zero   ->      0          ; just bitwise equal
;   0       0        ->      0          ; just +/-0 equal
;   0     non-zero   ->      non-zero   ; neither

ptest     xmm2, xmm0         ; CF = ( (not(xmm2) AND xmm0) == 0)
jc  reflexive_equal
; inputs in xmm0, xmm1
movaps   xmm2, xmm0
cmpeqps  xmm2, xmm1    ; -1:ieee_equal.  EQ_OQ predicate in the expanded notation for VEX encoding
pcmpeqd  xmm0, xmm1    ; -1:bitwise equal
orps     xmm0, xmm2
; xmm0 = -1:(where an element is bitwise or ieee equal)   0:elsewhere

movmskps eax, xmm0
test     eax, eax
jnz at_least_one_equal
; else  all different
// UNFINISHED start of an idea
bitdiff = _mm_xor_si128(A, B);
signbitdiff = _mm_srai_epi32(bitdiff, 31);   // broadcast the diff in sign bit to the whole vector
signbitdiff = _mm_srli_epi32(bitdiff, 1);    // zero the sign bit
something = _mm_and_si128(bitdiff, signbitdiff);