如何使用SIMD实现atoi? 我想尝试使用SIMD指令编写一个ATOI实现,包括在(C++的JSON读/写库)中。它目前在其他地方进行了一些SSE2和SSE4.2优化

如何使用SIMD实现atoi? 我想尝试使用SIMD指令编写一个ATOI实现,包括在(C++的JSON读/写库)中。它目前在其他地方进行了一些SSE2和SSE4.2优化,c++,x86,sse,simd,atoi,C++,X86,Sse,Simd,Atoi,如果是速度增益,可以并行完成多个atoi结果。字符串最初来自JSON数据的缓冲区,因此多atoi函数必须执行任何所需的swizzling 我提出的算法如下: 我可以按以下方式初始化长度为N的向量: [10^N..10^1] 我将缓冲区中的每个字符转换为一个整数,并将它们放在另一个向量中 我取有效数字向量中的每个数字,乘以数字向量中的匹配数字,然后对结果求和 我的目标是x86和x86-64体系结构 我知道AVX2支持三个操作数的融合乘加运算,因此我可以执行Sum=Number*有效数字+Sum。

如果是速度增益,可以并行完成多个
atoi
结果。字符串最初来自JSON数据的缓冲区,因此多atoi函数必须执行任何所需的swizzling

我提出的算法如下:

  • 我可以按以下方式初始化长度为N的向量: [10^N..10^1]
  • 我将缓冲区中的每个字符转换为一个整数,并将它们放在另一个向量中
  • 我取有效数字向量中的每个数字,乘以数字向量中的匹配数字,然后对结果求和 我的目标是x86和x86-64体系结构

    我知道AVX2支持三个操作数的融合乘加运算,因此我可以执行Sum=Number*有效数字+Sum。
    这就是我到目前为止所取得的成绩。
    我的算法正确吗?有更好的办法吗?

    有使用SIMD指令集的atoi参考实现吗?

    我会这样处理这个问题:

  • 将累加器初始化为0
  • 将字符串的下四个字符加载到SSE寄存器中
  • 从每个字符中减去值
    '0'
  • 查找向量中无符号值大于
    9
    的第一个值
  • 如果找到一个值,将向量的分量向右移动,使上一步中找到的值正好向外移动
  • 加载一个包含十次幂的向量(
    1000
    100
    10
    1
    )并与之相乘
  • 计算向量中所有项的总和
  • 将累加器与适当的值相乘(取决于步骤5中的移位数),然后将向量相加。您可以使用FMA指令来实现这一点,但我不知道对于整数是否存在这样的指令
  • 如果在步骤4中未找到大于
    9
    的值,请转至步骤2
  • 返回蓄能器
  • 您可以通过在步骤5中将所有从错误条目开始的条目归零来简化算法,而不是移位,然后在最后除以适当的十次方


    请记住,此算法读取的内容超过字符串的结尾,因此不能替代
    atoi

    该算法及其实现现在已经完成。它是完整的,并且(适度地)经过测试(针对更少的恒定内存使用和对加字符的容忍度进行了更新)

    此代码的属性如下:

    .intel_syntax noprefix
    .data
      .align 64
        ddqDigitRange: .byte  '0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0
        ddqShuffleMask:.byte  15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 
        ddqFactor1:    .word  1,10,100,1000, 1,10,100,1000  
        ddqFactor2:    .long  1,10000,100000000,0
    .text    
    _start:
       mov   esi, lpInputNumberString
       /* (**A**) indicate negative number in EDX */
       mov   eax, -1
       xor   ecx, ecx
       xor   edx, edx
       mov   bl,  byte ptr [esi]
       cmp   bl,  '-'
       cmove edx, eax
       cmp   bl,  '+'
       cmove ecx, eax
       sub   esi, edx
       sub   esi, ecx
       /* (**B**)remove leading zeros */
       xor   eax,eax               /* return value ZERO */
      remove_leading_zeros:
       inc   esi
       cmp   byte ptr [esi-1], '0'  /* skip leading zeros */
      je remove_leading_zeros
       cmp   byte ptr [esi-1], 0    /* catch empty string/number */
      je FINISH
       dec   esi
       /* check for valid digit-chars and invert from front to back */
       pxor      xmm2, xmm2         
       movdqa    xmm0, xmmword ptr [ddqDigitRange]
       movdqu    xmm1, xmmword ptr [esi]
       pcmpistri xmm0, xmm1, 0b00010100 /* (**C**) iim8=Unsigned bytes, Ranges, Negative Polarity(-), returns strlen() in ECX */
      jo FINISH             /* if first char is invalid return 0 - prevent processing empty string - 0 is still in EAX */
       mov al , '0'         /* value to subtract from chars */
       sub ecx, 16          /* len-16=negative to zero for shuffle mask */
       movd      xmm0, ecx
       pshufb    xmm0, xmm2 /* broadcast CL to all 16 BYTEs */
       paddb     xmm0, xmmword ptr [ddqShuffleMask] /* Generate permute mask for PSHUFB - all bytes < 0 have highest bit set means place gets zeroed */
       pshufb    xmm1, xmm0 /* (**D**) permute - now from highest to lowest BYTE are factors 10^0, 10^1, 10^2, ... */
       movd      xmm0, eax                         /* AL='0' from above */
       pshufb    xmm0, xmm2                        /* broadcast AL to XMM0 */
       psubusb   xmm1, xmm0                        /* (**1**) */
       movdqa    xmm0, xmm1
       punpcklbw xmm0, xmm2                        /* (**2**) */
       punpckhbw xmm1, xmm2
       pmaddwd   xmm0, xmmword ptr [ddqFactor1]    /* (**3**) */
       pmaddwd   xmm1, xmmword ptr [ddqFactor1]
       phaddd    xmm0, xmm1                        /* (**4**) */
       pmulld    xmm0, xmmword ptr [ddqFactor2]    /* (**5**) */
       pshufd    xmm1, xmm0, 0b11101110            /* (**6**) */
       paddd     xmm0, xmm1
       pshufd    xmm1, xmm0, 0b01010101            /* (**7**) */
       paddd     xmm0, xmm1
       movd      eax, xmm0
       /* negate if negative number */              
       add       eax, edx                          /* (**8**) */
       xor       eax, edx
      FINISH:
       /* EAX is return (u)int value */
    
    Throughput Analysis Report
    --------------------------
    Block Throughput: 16.10 Cycles       Throughput Bottleneck: InterIteration
    
    Port Binding In Cycles Per Iteration:
    ---------------------------------------------------------------------------------------
    |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
    ---------------------------------------------------------------------------------------
    | Cycles | 9.5    0.0  | 10.0 | 4.5    4.5  | 4.5    4.5  | 0.0  | 11.1 | 11.4 | 0.0  |
    ---------------------------------------------------------------------------------------
    
    N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
    D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
    F - Macro Fusion with the previous instruction occurred
    * - instruction micro-ops not bound to a port
    ^ - Micro Fusion happened
    # - ESP Tracking sync uop was issued
    @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
    ! - instruction not supported, was not accounted in Analysis
    
    | Num Of |                    Ports pressure in cycles                     |    |
    |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
    ---------------------------------------------------------------------------------
    |   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
    |   0*   |           |     |           |           |     |     |     |     |    | xor ecx, ecx
    |   0*   |           |     |           |           |     |     |     |     |    | xor edx, edx
    |   1    |           | 0.1 |           |           |     |     | 0.9 |     |    | dec eax
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov bl, byte ptr [esi]
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | cmp bl, 0x2d
    |   2    | 0.1       | 0.2 |           |           |     |     | 1.8 |     | CP | cmovz edx, eax
    |   1    | 0.1       | 0.5 |           |           |     |     | 0.4 |     | CP | cmp bl, 0x2b
    |   2    | 0.5       | 0.2 |           |           |     |     | 1.2 |     | CP | cmovz ecx, eax
    |   1    | 0.2       | 0.5 |           |           |     |     | 0.2 |     | CP | sub esi, edx
    |   1    | 0.2       | 0.5 |           |           |     |     | 0.3 |     | CP | sub esi, ecx
    |   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
    |   1    | 0.3       | 0.1 |           |           |     |     | 0.6 |     | CP | inc esi
    |   2^   | 0.3       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.6 |     |    | cmp byte ptr [esi-0x1], 0x30
    |   0F   |           |     |           |           |     |     |     |     |    | jz 0xfffffffb
    |   2^   | 0.6       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.4 |     |    | cmp byte ptr [esi-0x1], 0x0
    |   0F   |           |     |           |           |     |     |     |     |    | jz 0x8b
    |   1    | 0.1       | 0.9 |           |           |     |     |     |     | CP | dec esi
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | movdqa xmm0, xmmword ptr [0x80492f0]
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | movdqu xmm1, xmmword ptr [esi]
    |   0*   |           |     |           |           |     |     |     |     |    | pxor xmm2, xmm2
    |   3    | 2.0       | 1.0 |           |           |     |     |     |     | CP | pcmpistri xmm0, xmm1, 0x14
    |   1    |           |     |           |           |     |     | 1.0 |     |    | jo 0x6e
    |   1    |           | 0.4 |           |           |     | 0.1 | 0.5 |     |    | mov al, 0x30
    |   1    | 0.1       | 0.5 |           |           |     | 0.1 | 0.3 |     | CP | sub ecx, 0x10
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | movd xmm0, ecx
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm0, xmm2
    |   2^   |           | 1.0 | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | paddb xmm0, xmmword ptr [0x80492c0]
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm1, xmm0
    |   1    |           |     |           |           |     | 1.0 |     |     |    | movd xmm0, eax
    |   1    |           |     |           |           |     | 1.0 |     |     |    | pshufb xmm0, xmm2
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | psubusb xmm1, xmm0
    |   0*   |           |     |           |           |     |     |     |     | CP | movdqa xmm0, xmm1
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | punpcklbw xmm0, xmm2
    |   1    |           |     |           |           |     | 1.0 |     |     |    | punpckhbw xmm1, xmm2
    |   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
    |   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
    |   3    |           | 1.0 |           |           |     | 2.0 |     |     | CP | phaddd xmm0, xmm1
    |   3^   | 2.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmulld xmm0, xmmword ptr [0x80492e0]
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0xee
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0x55
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
    |   1    | 1.0       |     |           |           |     |     |     |     | CP | movd eax, xmm0
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | add eax, edx
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | xor eax, edx
    Total Num Of Uops: 51
    
    Latency Analysis Report
    ---------------------------
    Latency: 64 Cycles
    
    N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
    D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
    F - Macro Fusion with the previous instruction occurred
    * - instruction micro-ops not bound to a port
    ^ - Micro Fusion happened
    # - ESP Tracking sync uop was issued
    @ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected
    ! - instruction not supported, was not accounted in Analysis
    
    The Resource delay is counted since all the sources of the instructions are ready
    and until the needed resource becomes available
    
    | Inst |                 Resource Delay In Cycles                  |    |
    | Num  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  | FE |    |
    -------------------------------------------------------------------------
    |  0   |         |    |         |         |    |    |    |    |    |    | xor eax, eax
    |  1   |         |    |         |         |    |    |    |    |    |    | xor ecx, ecx
    |  2   |         |    |         |         |    |    |    |    |    |    | xor edx, edx
    |  3   |         |    |         |         |    |    |    |    |    |    | dec eax
    |  4   |         |    |         |         |    |    |    |    | 1  | CP | mov bl, byte ptr [esi]
    |  5   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2d
    |  6   |         |    |         |         |    |    |    |    |    | CP | cmovz edx, eax
    |  7   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2b
    |  8   |         |    |         |         |    |    | 1  |    |    | CP | cmovz ecx, eax
    |  9   |         |    |         |         |    |    |    |    |    | CP | sub esi, edx
    | 10   |         |    |         |         |    |    |    |    |    | CP | sub esi, ecx
    | 11   |         |    |         |         |    |    |    |    | 3  |    | xor eax, eax
    | 12   |         |    |         |         |    |    |    |    |    | CP | inc esi
    | 13   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x30
    | 14   |         |    |         |         |    |    |    |    |    |    | jz 0xfffffffb
    | 15   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x0
    | 16   |         |    |         |         |    |    |    |    |    |    | jz 0x8b
    | 17   |         |    |         |         |    |    |    |    |    | CP | dec esi
    | 18   |         |    |         |         |    |    |    |    | 4  |    | movdqa xmm0, xmmword ptr [0x80492f0]
    | 19   |         |    |         |         |    |    |    |    |    | CP | movdqu xmm1, xmmword ptr [esi]
    | 20   |         |    |         |         |    |    |    |    | 5  |    | pxor xmm2, xmm2
    | 21   |         |    |         |         |    |    |    |    |    | CP | pcmpistri xmm0, xmm1, 0x14
    | 22   |         |    |         |         |    |    |    |    |    |    | jo 0x6e
    | 23   |         |    |         |         |    |    |    |    | 6  |    | mov al, 0x30
    | 24   |         |    |         |         |    |    |    |    |    | CP | sub ecx, 0x10
    | 25   |         |    |         |         |    |    |    |    |    | CP | movd xmm0, ecx
    | 26   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm0, xmm2
    | 27   |         |    |         |         |    |    |    |    | 7  | CP | paddb xmm0, xmmword ptr [0x80492c0]
    | 28   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm1, xmm0
    | 29   |         |    |         |         |    | 1  |    |    |    |    | movd xmm0, eax
    | 30   |         |    |         |         |    | 1  |    |    |    |    | pshufb xmm0, xmm2
    | 31   |         |    |         |         |    |    |    |    |    | CP | psubusb xmm1, xmm0
    | 32   |         |    |         |         |    |    |    |    |    | CP | movdqa xmm0, xmm1
    | 33   |         |    |         |         |    |    |    |    |    | CP | punpcklbw xmm0, xmm2
    | 34   |         |    |         |         |    |    |    |    |    |    | punpckhbw xmm1, xmm2
    | 35   |         |    |         |         |    |    |    |    | 9  | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
    | 36   |         |    |         |         |    |    |    |    | 9  |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
    | 37   |         |    |         |         |    |    |    |    |    | CP | phaddd xmm0, xmm1
    | 38   |         |    |         |         |    |    |    |    | 10 | CP | pmulld xmm0, xmmword ptr [0x80492e0]
    | 39   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0xee
    | 40   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
    | 41   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0x55
    | 42   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
    | 43   |         |    |         |         |    |    |    |    |    | CP | movd eax, xmm0
    | 44   |         |    |         |         |    |    |    |    |    | CP | add eax, edx
    | 45   |         |    |         |         |    |    |    |    |    | CP | xor eax, edx
    
    Resource Conflict on Critical Paths: 
    -----------------------------------------------------------------
    |  Port  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  |
    -----------------------------------------------------------------
    | Cycles | 0    0  | 0  | 0    0  | 0    0  | 0  | 0  | 1  | 0  |
    -----------------------------------------------------------------
    
    List Of Delays On Critical Paths
    -------------------------------
    6 --> 8 1 Cycles Delay On Port6
    
    • 适用于
      int
      uint
      , 从
      MIN_INT=-2147483648
      MAX_INT=2147483647
      和 从
      MIN\u UINT=0
      MAX\u UINT=4294967295
    • 前导的
      '-'
      字符表示负数(合理),前导的
      '+'
      字符被忽略
    • 忽略前导零(带或不带符号字符)
    • 溢出被忽略-更大的数字只是环绕
    • 长度为零的字符串导致值
      0=-0
    • 识别无效字符,转换在第一个无效字符处结束
    • 最后一个前导零之后必须至少有16个字节是可访问的,并且EOS之后读取的可能安全影响留给调用方
    • 只需要SSE4.2
    关于此实现:

    .intel_syntax noprefix
    .data
      .align 64
        ddqDigitRange: .byte  '0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0
        ddqShuffleMask:.byte  15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 
        ddqFactor1:    .word  1,10,100,1000, 1,10,100,1000  
        ddqFactor2:    .long  1,10000,100000000,0
    .text    
    _start:
       mov   esi, lpInputNumberString
       /* (**A**) indicate negative number in EDX */
       mov   eax, -1
       xor   ecx, ecx
       xor   edx, edx
       mov   bl,  byte ptr [esi]
       cmp   bl,  '-'
       cmove edx, eax
       cmp   bl,  '+'
       cmove ecx, eax
       sub   esi, edx
       sub   esi, ecx
       /* (**B**)remove leading zeros */
       xor   eax,eax               /* return value ZERO */
      remove_leading_zeros:
       inc   esi
       cmp   byte ptr [esi-1], '0'  /* skip leading zeros */
      je remove_leading_zeros
       cmp   byte ptr [esi-1], 0    /* catch empty string/number */
      je FINISH
       dec   esi
       /* check for valid digit-chars and invert from front to back */
       pxor      xmm2, xmm2         
       movdqa    xmm0, xmmword ptr [ddqDigitRange]
       movdqu    xmm1, xmmword ptr [esi]
       pcmpistri xmm0, xmm1, 0b00010100 /* (**C**) iim8=Unsigned bytes, Ranges, Negative Polarity(-), returns strlen() in ECX */
      jo FINISH             /* if first char is invalid return 0 - prevent processing empty string - 0 is still in EAX */
       mov al , '0'         /* value to subtract from chars */
       sub ecx, 16          /* len-16=negative to zero for shuffle mask */
       movd      xmm0, ecx
       pshufb    xmm0, xmm2 /* broadcast CL to all 16 BYTEs */
       paddb     xmm0, xmmword ptr [ddqShuffleMask] /* Generate permute mask for PSHUFB - all bytes < 0 have highest bit set means place gets zeroed */
       pshufb    xmm1, xmm0 /* (**D**) permute - now from highest to lowest BYTE are factors 10^0, 10^1, 10^2, ... */
       movd      xmm0, eax                         /* AL='0' from above */
       pshufb    xmm0, xmm2                        /* broadcast AL to XMM0 */
       psubusb   xmm1, xmm0                        /* (**1**) */
       movdqa    xmm0, xmm1
       punpcklbw xmm0, xmm2                        /* (**2**) */
       punpckhbw xmm1, xmm2
       pmaddwd   xmm0, xmmword ptr [ddqFactor1]    /* (**3**) */
       pmaddwd   xmm1, xmmword ptr [ddqFactor1]
       phaddd    xmm0, xmm1                        /* (**4**) */
       pmulld    xmm0, xmmword ptr [ddqFactor2]    /* (**5**) */
       pshufd    xmm1, xmm0, 0b11101110            /* (**6**) */
       paddd     xmm0, xmm1
       pshufd    xmm1, xmm0, 0b01010101            /* (**7**) */
       paddd     xmm0, xmm1
       movd      eax, xmm0
       /* negate if negative number */              
       add       eax, edx                          /* (**8**) */
       xor       eax, edx
      FINISH:
       /* EAX is return (u)int value */
    
    Throughput Analysis Report
    --------------------------
    Block Throughput: 16.10 Cycles       Throughput Bottleneck: InterIteration
    
    Port Binding In Cycles Per Iteration:
    ---------------------------------------------------------------------------------------
    |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
    ---------------------------------------------------------------------------------------
    | Cycles | 9.5    0.0  | 10.0 | 4.5    4.5  | 4.5    4.5  | 0.0  | 11.1 | 11.4 | 0.0  |
    ---------------------------------------------------------------------------------------
    
    N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
    D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
    F - Macro Fusion with the previous instruction occurred
    * - instruction micro-ops not bound to a port
    ^ - Micro Fusion happened
    # - ESP Tracking sync uop was issued
    @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
    ! - instruction not supported, was not accounted in Analysis
    
    | Num Of |                    Ports pressure in cycles                     |    |
    |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
    ---------------------------------------------------------------------------------
    |   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
    |   0*   |           |     |           |           |     |     |     |     |    | xor ecx, ecx
    |   0*   |           |     |           |           |     |     |     |     |    | xor edx, edx
    |   1    |           | 0.1 |           |           |     |     | 0.9 |     |    | dec eax
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov bl, byte ptr [esi]
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | cmp bl, 0x2d
    |   2    | 0.1       | 0.2 |           |           |     |     | 1.8 |     | CP | cmovz edx, eax
    |   1    | 0.1       | 0.5 |           |           |     |     | 0.4 |     | CP | cmp bl, 0x2b
    |   2    | 0.5       | 0.2 |           |           |     |     | 1.2 |     | CP | cmovz ecx, eax
    |   1    | 0.2       | 0.5 |           |           |     |     | 0.2 |     | CP | sub esi, edx
    |   1    | 0.2       | 0.5 |           |           |     |     | 0.3 |     | CP | sub esi, ecx
    |   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
    |   1    | 0.3       | 0.1 |           |           |     |     | 0.6 |     | CP | inc esi
    |   2^   | 0.3       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.6 |     |    | cmp byte ptr [esi-0x1], 0x30
    |   0F   |           |     |           |           |     |     |     |     |    | jz 0xfffffffb
    |   2^   | 0.6       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.4 |     |    | cmp byte ptr [esi-0x1], 0x0
    |   0F   |           |     |           |           |     |     |     |     |    | jz 0x8b
    |   1    | 0.1       | 0.9 |           |           |     |     |     |     | CP | dec esi
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | movdqa xmm0, xmmword ptr [0x80492f0]
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | movdqu xmm1, xmmword ptr [esi]
    |   0*   |           |     |           |           |     |     |     |     |    | pxor xmm2, xmm2
    |   3    | 2.0       | 1.0 |           |           |     |     |     |     | CP | pcmpistri xmm0, xmm1, 0x14
    |   1    |           |     |           |           |     |     | 1.0 |     |    | jo 0x6e
    |   1    |           | 0.4 |           |           |     | 0.1 | 0.5 |     |    | mov al, 0x30
    |   1    | 0.1       | 0.5 |           |           |     | 0.1 | 0.3 |     | CP | sub ecx, 0x10
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | movd xmm0, ecx
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm0, xmm2
    |   2^   |           | 1.0 | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | paddb xmm0, xmmword ptr [0x80492c0]
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm1, xmm0
    |   1    |           |     |           |           |     | 1.0 |     |     |    | movd xmm0, eax
    |   1    |           |     |           |           |     | 1.0 |     |     |    | pshufb xmm0, xmm2
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | psubusb xmm1, xmm0
    |   0*   |           |     |           |           |     |     |     |     | CP | movdqa xmm0, xmm1
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | punpcklbw xmm0, xmm2
    |   1    |           |     |           |           |     | 1.0 |     |     |    | punpckhbw xmm1, xmm2
    |   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
    |   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
    |   3    |           | 1.0 |           |           |     | 2.0 |     |     | CP | phaddd xmm0, xmm1
    |   3^   | 2.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmulld xmm0, xmmword ptr [0x80492e0]
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0xee
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0x55
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
    |   1    | 1.0       |     |           |           |     |     |     |     | CP | movd eax, xmm0
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | add eax, edx
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | xor eax, edx
    Total Num Of Uops: 51
    
    Latency Analysis Report
    ---------------------------
    Latency: 64 Cycles
    
    N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
    D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
    F - Macro Fusion with the previous instruction occurred
    * - instruction micro-ops not bound to a port
    ^ - Micro Fusion happened
    # - ESP Tracking sync uop was issued
    @ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected
    ! - instruction not supported, was not accounted in Analysis
    
    The Resource delay is counted since all the sources of the instructions are ready
    and until the needed resource becomes available
    
    | Inst |                 Resource Delay In Cycles                  |    |
    | Num  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  | FE |    |
    -------------------------------------------------------------------------
    |  0   |         |    |         |         |    |    |    |    |    |    | xor eax, eax
    |  1   |         |    |         |         |    |    |    |    |    |    | xor ecx, ecx
    |  2   |         |    |         |         |    |    |    |    |    |    | xor edx, edx
    |  3   |         |    |         |         |    |    |    |    |    |    | dec eax
    |  4   |         |    |         |         |    |    |    |    | 1  | CP | mov bl, byte ptr [esi]
    |  5   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2d
    |  6   |         |    |         |         |    |    |    |    |    | CP | cmovz edx, eax
    |  7   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2b
    |  8   |         |    |         |         |    |    | 1  |    |    | CP | cmovz ecx, eax
    |  9   |         |    |         |         |    |    |    |    |    | CP | sub esi, edx
    | 10   |         |    |         |         |    |    |    |    |    | CP | sub esi, ecx
    | 11   |         |    |         |         |    |    |    |    | 3  |    | xor eax, eax
    | 12   |         |    |         |         |    |    |    |    |    | CP | inc esi
    | 13   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x30
    | 14   |         |    |         |         |    |    |    |    |    |    | jz 0xfffffffb
    | 15   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x0
    | 16   |         |    |         |         |    |    |    |    |    |    | jz 0x8b
    | 17   |         |    |         |         |    |    |    |    |    | CP | dec esi
    | 18   |         |    |         |         |    |    |    |    | 4  |    | movdqa xmm0, xmmword ptr [0x80492f0]
    | 19   |         |    |         |         |    |    |    |    |    | CP | movdqu xmm1, xmmword ptr [esi]
    | 20   |         |    |         |         |    |    |    |    | 5  |    | pxor xmm2, xmm2
    | 21   |         |    |         |         |    |    |    |    |    | CP | pcmpistri xmm0, xmm1, 0x14
    | 22   |         |    |         |         |    |    |    |    |    |    | jo 0x6e
    | 23   |         |    |         |         |    |    |    |    | 6  |    | mov al, 0x30
    | 24   |         |    |         |         |    |    |    |    |    | CP | sub ecx, 0x10
    | 25   |         |    |         |         |    |    |    |    |    | CP | movd xmm0, ecx
    | 26   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm0, xmm2
    | 27   |         |    |         |         |    |    |    |    | 7  | CP | paddb xmm0, xmmword ptr [0x80492c0]
    | 28   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm1, xmm0
    | 29   |         |    |         |         |    | 1  |    |    |    |    | movd xmm0, eax
    | 30   |         |    |         |         |    | 1  |    |    |    |    | pshufb xmm0, xmm2
    | 31   |         |    |         |         |    |    |    |    |    | CP | psubusb xmm1, xmm0
    | 32   |         |    |         |         |    |    |    |    |    | CP | movdqa xmm0, xmm1
    | 33   |         |    |         |         |    |    |    |    |    | CP | punpcklbw xmm0, xmm2
    | 34   |         |    |         |         |    |    |    |    |    |    | punpckhbw xmm1, xmm2
    | 35   |         |    |         |         |    |    |    |    | 9  | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
    | 36   |         |    |         |         |    |    |    |    | 9  |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
    | 37   |         |    |         |         |    |    |    |    |    | CP | phaddd xmm0, xmm1
    | 38   |         |    |         |         |    |    |    |    | 10 | CP | pmulld xmm0, xmmword ptr [0x80492e0]
    | 39   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0xee
    | 40   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
    | 41   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0x55
    | 42   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
    | 43   |         |    |         |         |    |    |    |    |    | CP | movd eax, xmm0
    | 44   |         |    |         |         |    |    |    |    |    | CP | add eax, edx
    | 45   |         |    |         |         |    |    |    |    |    | CP | xor eax, edx
    
    Resource Conflict on Critical Paths: 
    -----------------------------------------------------------------
    |  Port  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  |
    -----------------------------------------------------------------
    | Cycles | 0    0  | 0  | 0    0  | 0    0  | 0  | 0  | 1  | 0  |
    -----------------------------------------------------------------
    
    List Of Delays On Critical Paths
    -------------------------------
    6 --> 8 1 Cycles Delay On Port6
    
    • 此代码示例可以使用GNU汇编程序(
      as
      )在开始时使用
      .intel\u语法noprefix
      )运行
    • 常数的数据占用是64字节(4*128位XMM),相当于一条缓存线
    • 代码占用是46条指令,51个微操作和64个延迟周期
    • 一个用于删除前导零的循环,否则除了错误处理之外没有跳转,所以
    • 时间复杂度为O(1)
    算法的方法:

    - Pointer to number string is expected in ESI
    - Check if first char is '-', then indicate if negative number in EDX (**A**)
    - Check for leading zeros and EOS (**B**)
    - Check string for valid digits and get strlen() of valid chars (**C**)
    - Reverse string so that power of 
      10^0 is always at BYTE 15
      10^1 is always at BYTE 14
      10^2 is always at BYTE 13
      10^3 is always at BYTE 12
      10^4 is always at BYTE 11 
      ... 
      and mask out all following chars (**D**)
    - Subtract saturated '0' from each of the 16 possible chars (**1**)
    - Take 16 consecutive byte-values and and split them to WORDs 
      in two XMM-registers (**2**)
      P O N M L K J I  | H G F E D C B A ->
        H   G   F   E  |   D   C   B   A (XMM0)
        P   O   N   M  |   L   K   J   I (XMM1)
    - Multiply each WORD by its place-value modulo 10000 (1,10,100,1000)
      (factors smaller then MAX_WORD, 4 factors per QWORD/halfXMM)
      (**2**) so we can horizontally combine twice before another multiply.
      The PMADDWD instruction can do this and the next step:
    - Horizontally add adjacent WORDs to DWORDs (**3**)
      H*1000+G*100  F*10+E*1  |  D*1000+C*100  B*10+A*1 (XMM0)
      P*1000+O*100  N*10+M*1  |  L*1000+K*100  J*10+I*1 (XMM1)
    - Horizontally add adjacent DWORDs from XMM0 and XMM1 to XMM0 (**4**)
      xmmDst[31-0]   = xmm0[63-32]  + xmm0[31-0]
      xmmDst[63-32]  = xmm0[127-96] + xmm0[95-64]
      xmmDst[95-64]  = xmm1[63-32]  + xmm1[31-0]
      xmmDst[127-96] = xmm1[127-96] + xmm1[95-64]
    - Values in XMM0 are multiplied with the factors (**5**)
      P*1000+O*100+N*10+M*1 (DWORD factor 1000000000000 = too big for DWORD, but possibly useful for QWORD number strings)
      L*1000+K*100+J*10+I*1 (DWORD factor 100000000)
      H*1000+G*100+F*10+E*1 (DWORD factor 10000)
      D*1000+C*100+B*10+A*1 (DWORD factor 1)
    - The last step is adding these four DWORDs together with 2*PHADDD emulated by 2*(PSHUFD+PADDD)
      - xmm0[31-0]  = xmm0[63-32]  + xmm0[31-0]   (**6**)
        xmm0[63-32] = xmm0[127-96] + xmm0[95-64]
          (the upper QWORD contains the same and is ignored)
      - xmm0[31-0]  = xmm0[63-32]  + xmm0[31-0]   (**7**)
    - If the number is negative (indicated in EDX by 000...0=pos or 111...1=neg), negate it(**8**)
    
    以及GNU汇编程序中使用英特尔语法的示例实现:

    .intel_syntax noprefix
    .data
      .align 64
        ddqDigitRange: .byte  '0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0
        ddqShuffleMask:.byte  15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 
        ddqFactor1:    .word  1,10,100,1000, 1,10,100,1000  
        ddqFactor2:    .long  1,10000,100000000,0
    .text    
    _start:
       mov   esi, lpInputNumberString
       /* (**A**) indicate negative number in EDX */
       mov   eax, -1
       xor   ecx, ecx
       xor   edx, edx
       mov   bl,  byte ptr [esi]
       cmp   bl,  '-'
       cmove edx, eax
       cmp   bl,  '+'
       cmove ecx, eax
       sub   esi, edx
       sub   esi, ecx
       /* (**B**)remove leading zeros */
       xor   eax,eax               /* return value ZERO */
      remove_leading_zeros:
       inc   esi
       cmp   byte ptr [esi-1], '0'  /* skip leading zeros */
      je remove_leading_zeros
       cmp   byte ptr [esi-1], 0    /* catch empty string/number */
      je FINISH
       dec   esi
       /* check for valid digit-chars and invert from front to back */
       pxor      xmm2, xmm2         
       movdqa    xmm0, xmmword ptr [ddqDigitRange]
       movdqu    xmm1, xmmword ptr [esi]
       pcmpistri xmm0, xmm1, 0b00010100 /* (**C**) iim8=Unsigned bytes, Ranges, Negative Polarity(-), returns strlen() in ECX */
      jo FINISH             /* if first char is invalid return 0 - prevent processing empty string - 0 is still in EAX */
       mov al , '0'         /* value to subtract from chars */
       sub ecx, 16          /* len-16=negative to zero for shuffle mask */
       movd      xmm0, ecx
       pshufb    xmm0, xmm2 /* broadcast CL to all 16 BYTEs */
       paddb     xmm0, xmmword ptr [ddqShuffleMask] /* Generate permute mask for PSHUFB - all bytes < 0 have highest bit set means place gets zeroed */
       pshufb    xmm1, xmm0 /* (**D**) permute - now from highest to lowest BYTE are factors 10^0, 10^1, 10^2, ... */
       movd      xmm0, eax                         /* AL='0' from above */
       pshufb    xmm0, xmm2                        /* broadcast AL to XMM0 */
       psubusb   xmm1, xmm0                        /* (**1**) */
       movdqa    xmm0, xmm1
       punpcklbw xmm0, xmm2                        /* (**2**) */
       punpckhbw xmm1, xmm2
       pmaddwd   xmm0, xmmword ptr [ddqFactor1]    /* (**3**) */
       pmaddwd   xmm1, xmmword ptr [ddqFactor1]
       phaddd    xmm0, xmm1                        /* (**4**) */
       pmulld    xmm0, xmmword ptr [ddqFactor2]    /* (**5**) */
       pshufd    xmm1, xmm0, 0b11101110            /* (**6**) */
       paddd     xmm0, xmm1
       pshufd    xmm1, xmm0, 0b01010101            /* (**7**) */
       paddd     xmm0, xmm1
       movd      eax, xmm0
       /* negate if negative number */              
       add       eax, edx                          /* (**8**) */
       xor       eax, edx
      FINISH:
       /* EAX is return (u)int value */
    
    Throughput Analysis Report
    --------------------------
    Block Throughput: 16.10 Cycles       Throughput Bottleneck: InterIteration
    
    Port Binding In Cycles Per Iteration:
    ---------------------------------------------------------------------------------------
    |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
    ---------------------------------------------------------------------------------------
    | Cycles | 9.5    0.0  | 10.0 | 4.5    4.5  | 4.5    4.5  | 0.0  | 11.1 | 11.4 | 0.0  |
    ---------------------------------------------------------------------------------------
    
    N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
    D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
    F - Macro Fusion with the previous instruction occurred
    * - instruction micro-ops not bound to a port
    ^ - Micro Fusion happened
    # - ESP Tracking sync uop was issued
    @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
    ! - instruction not supported, was not accounted in Analysis
    
    | Num Of |                    Ports pressure in cycles                     |    |
    |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
    ---------------------------------------------------------------------------------
    |   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
    |   0*   |           |     |           |           |     |     |     |     |    | xor ecx, ecx
    |   0*   |           |     |           |           |     |     |     |     |    | xor edx, edx
    |   1    |           | 0.1 |           |           |     |     | 0.9 |     |    | dec eax
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov bl, byte ptr [esi]
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | cmp bl, 0x2d
    |   2    | 0.1       | 0.2 |           |           |     |     | 1.8 |     | CP | cmovz edx, eax
    |   1    | 0.1       | 0.5 |           |           |     |     | 0.4 |     | CP | cmp bl, 0x2b
    |   2    | 0.5       | 0.2 |           |           |     |     | 1.2 |     | CP | cmovz ecx, eax
    |   1    | 0.2       | 0.5 |           |           |     |     | 0.2 |     | CP | sub esi, edx
    |   1    | 0.2       | 0.5 |           |           |     |     | 0.3 |     | CP | sub esi, ecx
    |   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
    |   1    | 0.3       | 0.1 |           |           |     |     | 0.6 |     | CP | inc esi
    |   2^   | 0.3       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.6 |     |    | cmp byte ptr [esi-0x1], 0x30
    |   0F   |           |     |           |           |     |     |     |     |    | jz 0xfffffffb
    |   2^   | 0.6       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.4 |     |    | cmp byte ptr [esi-0x1], 0x0
    |   0F   |           |     |           |           |     |     |     |     |    | jz 0x8b
    |   1    | 0.1       | 0.9 |           |           |     |     |     |     | CP | dec esi
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | movdqa xmm0, xmmword ptr [0x80492f0]
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | movdqu xmm1, xmmword ptr [esi]
    |   0*   |           |     |           |           |     |     |     |     |    | pxor xmm2, xmm2
    |   3    | 2.0       | 1.0 |           |           |     |     |     |     | CP | pcmpistri xmm0, xmm1, 0x14
    |   1    |           |     |           |           |     |     | 1.0 |     |    | jo 0x6e
    |   1    |           | 0.4 |           |           |     | 0.1 | 0.5 |     |    | mov al, 0x30
    |   1    | 0.1       | 0.5 |           |           |     | 0.1 | 0.3 |     | CP | sub ecx, 0x10
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | movd xmm0, ecx
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm0, xmm2
    |   2^   |           | 1.0 | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | paddb xmm0, xmmword ptr [0x80492c0]
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm1, xmm0
    |   1    |           |     |           |           |     | 1.0 |     |     |    | movd xmm0, eax
    |   1    |           |     |           |           |     | 1.0 |     |     |    | pshufb xmm0, xmm2
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | psubusb xmm1, xmm0
    |   0*   |           |     |           |           |     |     |     |     | CP | movdqa xmm0, xmm1
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | punpcklbw xmm0, xmm2
    |   1    |           |     |           |           |     | 1.0 |     |     |    | punpckhbw xmm1, xmm2
    |   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
    |   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
    |   3    |           | 1.0 |           |           |     | 2.0 |     |     | CP | phaddd xmm0, xmm1
    |   3^   | 2.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmulld xmm0, xmmword ptr [0x80492e0]
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0xee
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0x55
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
    |   1    | 1.0       |     |           |           |     |     |     |     | CP | movd eax, xmm0
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | add eax, edx
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | xor eax, edx
    Total Num Of Uops: 51
    
    Latency Analysis Report
    ---------------------------
    Latency: 64 Cycles
    
    N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
    D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
    F - Macro Fusion with the previous instruction occurred
    * - instruction micro-ops not bound to a port
    ^ - Micro Fusion happened
    # - ESP Tracking sync uop was issued
    @ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected
    ! - instruction not supported, was not accounted in Analysis
    
    The Resource delay is counted since all the sources of the instructions are ready
    and until the needed resource becomes available
    
    | Inst |                 Resource Delay In Cycles                  |    |
    | Num  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  | FE |    |
    -------------------------------------------------------------------------
    |  0   |         |    |         |         |    |    |    |    |    |    | xor eax, eax
    |  1   |         |    |         |         |    |    |    |    |    |    | xor ecx, ecx
    |  2   |         |    |         |         |    |    |    |    |    |    | xor edx, edx
    |  3   |         |    |         |         |    |    |    |    |    |    | dec eax
    |  4   |         |    |         |         |    |    |    |    | 1  | CP | mov bl, byte ptr [esi]
    |  5   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2d
    |  6   |         |    |         |         |    |    |    |    |    | CP | cmovz edx, eax
    |  7   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2b
    |  8   |         |    |         |         |    |    | 1  |    |    | CP | cmovz ecx, eax
    |  9   |         |    |         |         |    |    |    |    |    | CP | sub esi, edx
    | 10   |         |    |         |         |    |    |    |    |    | CP | sub esi, ecx
    | 11   |         |    |         |         |    |    |    |    | 3  |    | xor eax, eax
    | 12   |         |    |         |         |    |    |    |    |    | CP | inc esi
    | 13   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x30
    | 14   |         |    |         |         |    |    |    |    |    |    | jz 0xfffffffb
    | 15   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x0
    | 16   |         |    |         |         |    |    |    |    |    |    | jz 0x8b
    | 17   |         |    |         |         |    |    |    |    |    | CP | dec esi
    | 18   |         |    |         |         |    |    |    |    | 4  |    | movdqa xmm0, xmmword ptr [0x80492f0]
    | 19   |         |    |         |         |    |    |    |    |    | CP | movdqu xmm1, xmmword ptr [esi]
    | 20   |         |    |         |         |    |    |    |    | 5  |    | pxor xmm2, xmm2
    | 21   |         |    |         |         |    |    |    |    |    | CP | pcmpistri xmm0, xmm1, 0x14
    | 22   |         |    |         |         |    |    |    |    |    |    | jo 0x6e
    | 23   |         |    |         |         |    |    |    |    | 6  |    | mov al, 0x30
    | 24   |         |    |         |         |    |    |    |    |    | CP | sub ecx, 0x10
    | 25   |         |    |         |         |    |    |    |    |    | CP | movd xmm0, ecx
    | 26   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm0, xmm2
    | 27   |         |    |         |         |    |    |    |    | 7  | CP | paddb xmm0, xmmword ptr [0x80492c0]
    | 28   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm1, xmm0
    | 29   |         |    |         |         |    | 1  |    |    |    |    | movd xmm0, eax
    | 30   |         |    |         |         |    | 1  |    |    |    |    | pshufb xmm0, xmm2
    | 31   |         |    |         |         |    |    |    |    |    | CP | psubusb xmm1, xmm0
    | 32   |         |    |         |         |    |    |    |    |    | CP | movdqa xmm0, xmm1
    | 33   |         |    |         |         |    |    |    |    |    | CP | punpcklbw xmm0, xmm2
    | 34   |         |    |         |         |    |    |    |    |    |    | punpckhbw xmm1, xmm2
    | 35   |         |    |         |         |    |    |    |    | 9  | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
    | 36   |         |    |         |         |    |    |    |    | 9  |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
    | 37   |         |    |         |         |    |    |    |    |    | CP | phaddd xmm0, xmm1
    | 38   |         |    |         |         |    |    |    |    | 10 | CP | pmulld xmm0, xmmword ptr [0x80492e0]
    | 39   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0xee
    | 40   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
    | 41   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0x55
    | 42   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
    | 43   |         |    |         |         |    |    |    |    |    | CP | movd eax, xmm0
    | 44   |         |    |         |         |    |    |    |    |    | CP | add eax, edx
    | 45   |         |    |         |         |    |    |    |    |    | CP | xor eax, edx
    
    Resource Conflict on Critical Paths: 
    -----------------------------------------------------------------
    |  Port  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  |
    -----------------------------------------------------------------
    | Cycles | 0    0  | 0  | 0    0  | 0    0  | 0  | 0  | 1  | 0  |
    -----------------------------------------------------------------
    
    List Of Delays On Critical Paths
    -------------------------------
    6 --> 8 1 Cycles Delay On Port6
    
    针对Haswell 32位处理器的英特尔IACA延迟分析结果:

    .intel_syntax noprefix
    .data
      .align 64
        ddqDigitRange: .byte  '0','9',0,0,0,0,0,0,0,0,0,0,0,0,0,0
        ddqShuffleMask:.byte  15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 
        ddqFactor1:    .word  1,10,100,1000, 1,10,100,1000  
        ddqFactor2:    .long  1,10000,100000000,0
    .text    
    _start:
       mov   esi, lpInputNumberString
       /* (**A**) indicate negative number in EDX */
       mov   eax, -1
       xor   ecx, ecx
       xor   edx, edx
       mov   bl,  byte ptr [esi]
       cmp   bl,  '-'
       cmove edx, eax
       cmp   bl,  '+'
       cmove ecx, eax
       sub   esi, edx
       sub   esi, ecx
       /* (**B**)remove leading zeros */
       xor   eax,eax               /* return value ZERO */
      remove_leading_zeros:
       inc   esi
       cmp   byte ptr [esi-1], '0'  /* skip leading zeros */
      je remove_leading_zeros
       cmp   byte ptr [esi-1], 0    /* catch empty string/number */
      je FINISH
       dec   esi
       /* check for valid digit-chars and invert from front to back */
       pxor      xmm2, xmm2         
       movdqa    xmm0, xmmword ptr [ddqDigitRange]
       movdqu    xmm1, xmmword ptr [esi]
       pcmpistri xmm0, xmm1, 0b00010100 /* (**C**) iim8=Unsigned bytes, Ranges, Negative Polarity(-), returns strlen() in ECX */
      jo FINISH             /* if first char is invalid return 0 - prevent processing empty string - 0 is still in EAX */
       mov al , '0'         /* value to subtract from chars */
       sub ecx, 16          /* len-16=negative to zero for shuffle mask */
       movd      xmm0, ecx
       pshufb    xmm0, xmm2 /* broadcast CL to all 16 BYTEs */
       paddb     xmm0, xmmword ptr [ddqShuffleMask] /* Generate permute mask for PSHUFB - all bytes < 0 have highest bit set means place gets zeroed */
       pshufb    xmm1, xmm0 /* (**D**) permute - now from highest to lowest BYTE are factors 10^0, 10^1, 10^2, ... */
       movd      xmm0, eax                         /* AL='0' from above */
       pshufb    xmm0, xmm2                        /* broadcast AL to XMM0 */
       psubusb   xmm1, xmm0                        /* (**1**) */
       movdqa    xmm0, xmm1
       punpcklbw xmm0, xmm2                        /* (**2**) */
       punpckhbw xmm1, xmm2
       pmaddwd   xmm0, xmmword ptr [ddqFactor1]    /* (**3**) */
       pmaddwd   xmm1, xmmword ptr [ddqFactor1]
       phaddd    xmm0, xmm1                        /* (**4**) */
       pmulld    xmm0, xmmword ptr [ddqFactor2]    /* (**5**) */
       pshufd    xmm1, xmm0, 0b11101110            /* (**6**) */
       paddd     xmm0, xmm1
       pshufd    xmm1, xmm0, 0b01010101            /* (**7**) */
       paddd     xmm0, xmm1
       movd      eax, xmm0
       /* negate if negative number */              
       add       eax, edx                          /* (**8**) */
       xor       eax, edx
      FINISH:
       /* EAX is return (u)int value */
    
    Throughput Analysis Report
    --------------------------
    Block Throughput: 16.10 Cycles       Throughput Bottleneck: InterIteration
    
    Port Binding In Cycles Per Iteration:
    ---------------------------------------------------------------------------------------
    |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
    ---------------------------------------------------------------------------------------
    | Cycles | 9.5    0.0  | 10.0 | 4.5    4.5  | 4.5    4.5  | 0.0  | 11.1 | 11.4 | 0.0  |
    ---------------------------------------------------------------------------------------
    
    N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
    D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
    F - Macro Fusion with the previous instruction occurred
    * - instruction micro-ops not bound to a port
    ^ - Micro Fusion happened
    # - ESP Tracking sync uop was issued
    @ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
    ! - instruction not supported, was not accounted in Analysis
    
    | Num Of |                    Ports pressure in cycles                     |    |
    |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
    ---------------------------------------------------------------------------------
    |   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
    |   0*   |           |     |           |           |     |     |     |     |    | xor ecx, ecx
    |   0*   |           |     |           |           |     |     |     |     |    | xor edx, edx
    |   1    |           | 0.1 |           |           |     |     | 0.9 |     |    | dec eax
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | mov bl, byte ptr [esi]
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | cmp bl, 0x2d
    |   2    | 0.1       | 0.2 |           |           |     |     | 1.8 |     | CP | cmovz edx, eax
    |   1    | 0.1       | 0.5 |           |           |     |     | 0.4 |     | CP | cmp bl, 0x2b
    |   2    | 0.5       | 0.2 |           |           |     |     | 1.2 |     | CP | cmovz ecx, eax
    |   1    | 0.2       | 0.5 |           |           |     |     | 0.2 |     | CP | sub esi, edx
    |   1    | 0.2       | 0.5 |           |           |     |     | 0.3 |     | CP | sub esi, ecx
    |   0*   |           |     |           |           |     |     |     |     |    | xor eax, eax
    |   1    | 0.3       | 0.1 |           |           |     |     | 0.6 |     | CP | inc esi
    |   2^   | 0.3       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.6 |     |    | cmp byte ptr [esi-0x1], 0x30
    |   0F   |           |     |           |           |     |     |     |     |    | jz 0xfffffffb
    |   2^   | 0.6       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.4 |     |    | cmp byte ptr [esi-0x1], 0x0
    |   0F   |           |     |           |           |     |     |     |     |    | jz 0x8b
    |   1    | 0.1       | 0.9 |           |           |     |     |     |     | CP | dec esi
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | movdqa xmm0, xmmword ptr [0x80492f0]
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | movdqu xmm1, xmmword ptr [esi]
    |   0*   |           |     |           |           |     |     |     |     |    | pxor xmm2, xmm2
    |   3    | 2.0       | 1.0 |           |           |     |     |     |     | CP | pcmpistri xmm0, xmm1, 0x14
    |   1    |           |     |           |           |     |     | 1.0 |     |    | jo 0x6e
    |   1    |           | 0.4 |           |           |     | 0.1 | 0.5 |     |    | mov al, 0x30
    |   1    | 0.1       | 0.5 |           |           |     | 0.1 | 0.3 |     | CP | sub ecx, 0x10
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | movd xmm0, ecx
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm0, xmm2
    |   2^   |           | 1.0 | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | paddb xmm0, xmmword ptr [0x80492c0]
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufb xmm1, xmm0
    |   1    |           |     |           |           |     | 1.0 |     |     |    | movd xmm0, eax
    |   1    |           |     |           |           |     | 1.0 |     |     |    | pshufb xmm0, xmm2
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | psubusb xmm1, xmm0
    |   0*   |           |     |           |           |     |     |     |     | CP | movdqa xmm0, xmm1
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | punpcklbw xmm0, xmm2
    |   1    |           |     |           |           |     | 1.0 |     |     |    | punpckhbw xmm1, xmm2
    |   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
    |   2^   | 1.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
    |   3    |           | 1.0 |           |           |     | 2.0 |     |     | CP | phaddd xmm0, xmm1
    |   3^   | 2.0       |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     | CP | pmulld xmm0, xmmword ptr [0x80492e0]
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0xee
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
    |   1    |           |     |           |           |     | 1.0 |     |     | CP | pshufd xmm1, xmm0, 0x55
    |   1    |           | 1.0 |           |           |     |     |     |     | CP | paddd xmm0, xmm1
    |   1    | 1.0       |     |           |           |     |     |     |     | CP | movd eax, xmm0
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | add eax, edx
    |   1    |           |     |           |           |     |     | 1.0 |     | CP | xor eax, edx
    Total Num Of Uops: 51
    
    Latency Analysis Report
    ---------------------------
    Latency: 64 Cycles
    
    N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
    D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
    F - Macro Fusion with the previous instruction occurred
    * - instruction micro-ops not bound to a port
    ^ - Micro Fusion happened
    # - ESP Tracking sync uop was issued
    @ - Intel(R) AVX to Intel(R) SSE code switch, dozens of cycles penalty is expected
    ! - instruction not supported, was not accounted in Analysis
    
    The Resource delay is counted since all the sources of the instructions are ready
    and until the needed resource becomes available
    
    | Inst |                 Resource Delay In Cycles                  |    |
    | Num  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  | FE |    |
    -------------------------------------------------------------------------
    |  0   |         |    |         |         |    |    |    |    |    |    | xor eax, eax
    |  1   |         |    |         |         |    |    |    |    |    |    | xor ecx, ecx
    |  2   |         |    |         |         |    |    |    |    |    |    | xor edx, edx
    |  3   |         |    |         |         |    |    |    |    |    |    | dec eax
    |  4   |         |    |         |         |    |    |    |    | 1  | CP | mov bl, byte ptr [esi]
    |  5   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2d
    |  6   |         |    |         |         |    |    |    |    |    | CP | cmovz edx, eax
    |  7   |         |    |         |         |    |    |    |    |    | CP | cmp bl, 0x2b
    |  8   |         |    |         |         |    |    | 1  |    |    | CP | cmovz ecx, eax
    |  9   |         |    |         |         |    |    |    |    |    | CP | sub esi, edx
    | 10   |         |    |         |         |    |    |    |    |    | CP | sub esi, ecx
    | 11   |         |    |         |         |    |    |    |    | 3  |    | xor eax, eax
    | 12   |         |    |         |         |    |    |    |    |    | CP | inc esi
    | 13   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x30
    | 14   |         |    |         |         |    |    |    |    |    |    | jz 0xfffffffb
    | 15   |         |    |         |         |    |    |    |    |    |    | cmp byte ptr [esi-0x1], 0x0
    | 16   |         |    |         |         |    |    |    |    |    |    | jz 0x8b
    | 17   |         |    |         |         |    |    |    |    |    | CP | dec esi
    | 18   |         |    |         |         |    |    |    |    | 4  |    | movdqa xmm0, xmmword ptr [0x80492f0]
    | 19   |         |    |         |         |    |    |    |    |    | CP | movdqu xmm1, xmmword ptr [esi]
    | 20   |         |    |         |         |    |    |    |    | 5  |    | pxor xmm2, xmm2
    | 21   |         |    |         |         |    |    |    |    |    | CP | pcmpistri xmm0, xmm1, 0x14
    | 22   |         |    |         |         |    |    |    |    |    |    | jo 0x6e
    | 23   |         |    |         |         |    |    |    |    | 6  |    | mov al, 0x30
    | 24   |         |    |         |         |    |    |    |    |    | CP | sub ecx, 0x10
    | 25   |         |    |         |         |    |    |    |    |    | CP | movd xmm0, ecx
    | 26   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm0, xmm2
    | 27   |         |    |         |         |    |    |    |    | 7  | CP | paddb xmm0, xmmword ptr [0x80492c0]
    | 28   |         |    |         |         |    |    |    |    |    | CP | pshufb xmm1, xmm0
    | 29   |         |    |         |         |    | 1  |    |    |    |    | movd xmm0, eax
    | 30   |         |    |         |         |    | 1  |    |    |    |    | pshufb xmm0, xmm2
    | 31   |         |    |         |         |    |    |    |    |    | CP | psubusb xmm1, xmm0
    | 32   |         |    |         |         |    |    |    |    |    | CP | movdqa xmm0, xmm1
    | 33   |         |    |         |         |    |    |    |    |    | CP | punpcklbw xmm0, xmm2
    | 34   |         |    |         |         |    |    |    |    |    |    | punpckhbw xmm1, xmm2
    | 35   |         |    |         |         |    |    |    |    | 9  | CP | pmaddwd xmm0, xmmword ptr [0x80492d0]
    | 36   |         |    |         |         |    |    |    |    | 9  |    | pmaddwd xmm1, xmmword ptr [0x80492d0]
    | 37   |         |    |         |         |    |    |    |    |    | CP | phaddd xmm0, xmm1
    | 38   |         |    |         |         |    |    |    |    | 10 | CP | pmulld xmm0, xmmword ptr [0x80492e0]
    | 39   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0xee
    | 40   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
    | 41   |         |    |         |         |    |    |    |    |    | CP | pshufd xmm1, xmm0, 0x55
    | 42   |         |    |         |         |    |    |    |    |    | CP | paddd xmm0, xmm1
    | 43   |         |    |         |         |    |    |    |    |    | CP | movd eax, xmm0
    | 44   |         |    |         |         |    |    |    |    |    | CP | add eax, edx
    | 45   |         |    |         |         |    |    |    |    |    | CP | xor eax, edx
    
    Resource Conflict on Critical Paths: 
    -----------------------------------------------------------------
    |  Port  | 0  - DV | 1  | 2  - D  | 3  - D  | 4  | 5  | 6  | 7  |
    -----------------------------------------------------------------
    | Cycles | 0    0  | 0  | 0    0  | 0    0  | 0  | 0  | 1  | 0  |
    -----------------------------------------------------------------
    
    List Of Delays On Critical Paths
    -------------------------------
    6 --> 8 1 Cycles Delay On Port6
    
    Peter Cordes在评论中建议的另一种处理方法是用
    imul
    替换最后两个
    add+xor
    指令。这种操作码的浓度可能更高。不幸的是,IACA不支持该指令,并抛出一个
    !-不支持说明,未在分析中说明
    注释。尽管如此,尽管我喜欢操作码的减少以及从(2UOP,2c延迟)减少到(1UOP,3c延迟-“更糟糕的延迟,但AMD上仍然有一个m-op”),但我更愿意让实现者选择哪种方式。我还没有检查以下代码是否足以解析任何数字。只是为了完整性而提到,可能需要在其他部分修改代码(特别是处理正数)

    替代方案可能是将最后两行替换为:

      ...
      /* negate if negative number */              
       imul eax, edx
      FINISH:
      /* EAX is return (u)int value */
    

    如果您正试图使用x86 SIMD指令执行此操作,我建议您将其标记为,以便阅读相应标记队列的人可以使用一些有用的技术查看您的文章。相关SSE字符串解析问题:(打包比较->洗牌掩码查找)。这里可能不需要这样做,因为您只需要“查找一个字符串的结尾”。@FUZxxl我见过的大多数问题都是将SIMD标记为C,因为这是他们用来实现SIMD操作的工具。@the_-drow:您打算用SSE intrinsic将C作为目标,还是用asm(例如,用英特尔语法(NASM/YASM)或AT&T语法)编写整个函数(gcc风格))?这是你的两个好选项。无论哪种方式,请参阅链接。内联ASM是第三个选项,但这是一个错误的选择。另外,我注意到stgatilov的SSE IPv4地址解析器I linke