Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/magento/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Linux 优化打印计数器的循环_Linux_Assembly_Optimization_Nasm_X86 64_X86 - Fatal编程技术网

Linux 优化打印计数器的循环

Linux 优化打印计数器的循环,linux,assembly,optimization,nasm,x86-64,x86,Linux,Assembly,Optimization,Nasm,X86 64,X86,我有一个非常小的循环程序,可以打印从5000000到1的数字。我想让它跑得尽可能快 我正在学习使用NASM的linux x86-64汇编 global main extern printf main: push rbx mov rax,5000000d print: push rax push rcx mov

我有一个非常小的循环程序,可以打印从5000000到1的数字。我想让它跑得尽可能快

我正在学习使用NASM的linux x86-64汇编

global  main
extern  printf
main:
    push    rbx                     
    mov rax,5000000d
print:
    push    rax                     
    push    rcx                     
    mov     rdi, format             
    mov     rsi, rax                
    call    printf                  
    pop     rcx                     
    pop     rax                     
    dec     rax                     
    jnz     print                   
    pop     rbx                     
    ret

format:
db  "%ld", 10, 0

实际上,您正在打印一个固定字符串。我会将该字符串预生成为一个长常量


然后,该程序变成了对
写入的单个调用(或处理不完整写入的短循环)。

您实际上是在打印一个固定字符串。我会将该字符串预生成为一个长常量


然后,程序变成了对
写入的单个调用(或处理不完整写入的短循环)。

对printf的调用完全控制了即使是效率极低的循环的运行时间。(您是否注意到,即使您从未在任何地方使用过rcx,您也会推/弹出它?这可能是使用过程中的遗留问题)

要了解有关编写高效x86 asm的更多信息,请参阅。(还有他的微体系结构指南,如果你想真正深入了解特定CPU的细节以及它们的不同之处:一个uarch CPU上的最佳配置可能不在另一个上。例如,IMUL r64在英特尔CPU上的吞吐量和延迟要比AMD好得多,但在英特尔pre Broadwell上CMOV和ADC是2个UOP,而在Intel pre Broadwell上则是2个周期延迟,而不是1个周期延迟。)AMD,因为3输入ALU m-ops(标志+两个寄存器)对AMD来说不是问题。)也可以在标签wiki中查看其他链接


纯粹优化循环而不更改对printf的5M调用仅作为如何正确编写循环的示例,而不是实际加速此代码的示例。但让我们从这个开始:

; trivial fixes to loop efficiently while calling the same slow function
global  main
extern  printf
main:
    push    rbx
    mov     ebx, 5000000         ; don't waste a REX prefix for constants that fit in 32 bits
.print:
    ;; removed the push/pops from inside the loop.
    ; Use call-preserved regs instead of saving/restoring stuff inside a loop yourself.
    mov     edi, format          ; static data / code always has a 32-bit address
    mov     esi, ebx
    xor     eax, eax             ; The x86-64 SysV ABI requires al = number of FP args passed in FP registers for variadic functions
    call    printf                  
    dec     ebx
    jnz     .print

    pop     rbx                ; restore rbx, the one call-preserved reg we actually used.
    xor     eax,eax            ; successful exit status.
    ret

section .rodata       ; it's usually best to put constant data in a separate section of the text segment, not right next to code.
format:
db  "%ld", 10, 0

为了加快速度,我们应该在将连续整数转换为字符串时利用冗余。由于
“5000000\n”
只有8个字节长(包括换行符),因此字符串表示适合64位寄存器

我们可以将该字符串存储到缓冲区中,并按字符串长度递增指针。(因为对于较小的数字,它会变短,所以只需将当前字符串长度保留在寄存器中,您可以在发生更改的特殊情况分支中更新它。)

我们可以适当地减少字符串表示,以避免(重新)执行除以10的过程,从而将整数转换为十进制字符串

由于进位/借位不会在寄存器内自然传播,并且指令在64位模式下不可用(并且只在AX上工作,甚至在EAX上也不工作,而且速度很慢),因此我们必须自己做。我们每次递减1,所以我们知道会发生什么。我们可以通过展开10次来处理最低有效位,因此没有分支来处理它

还要注意的是,由于我们希望按打印顺序对数字进行排序,所以进位的方向是错误的,因为x86是小端。如果有一种很好的方法可以利用字符串的其他字节顺序,我们可以使用BSWAP或MOVBE。(但请注意,MOVBE r64是Skylake上的3个融合域UOP,其中2个是ALU UOP。BSWAP r64也是2个UOP。)

也许我们应该在XMM向量寄存器的两半中并行执行奇偶计数器。但一旦绳子短于8B,这就无法正常工作。一次存储一个数字字符串,我们可以很容易地重叠。尽管如此,我们仍然可以在矢量寄存器中进行进位传播,并使用MOVQ和MOVHPS分别存储两半。或者,由于0到5M之间的数字中有5分之4是7位数字,因此有必要为特殊情况编写代码,在这种情况下,我们可以存储两个数字的整个16B向量

处理较短字符串的更好方法:SSSE3 PSHUFB将两个字符串洗牌到向量寄存器中的左压缩位置,然后使用单个MOVUPS同时存储两个字符串。洗牌掩码只需要在字符串长度(位数)改变时更新,因此不经常执行的进位处理特殊情况代码也可以这样做

循环的热点部分的矢量化应该非常简单和便宜,并且应该是性能的两倍

;;; Optimized version: keep the string data in a register and modify it
;;; instead of doing the whole int->string conversion every time.

section  .bss
printbuf:  resb 1024*128 + 4096     ;  Buffer size ~= half L2 cache size on Intel SnB-family.  Or use a giant buffer that we write() once.  Or maybe vmsplice to give it away to the kernel, since we only run once.

global  main
extern  printf
main:
    push    rbx

    ; use some REX-only regs for values that we're always going to use a REX prefix with anyway for 64-bit operand size.
    mov     rdx, `5000000\n`   ; (NASM string constants as integers work like little-endian, so AL = '5' = 0x35 and the high byte holds '\n' = 10).  Note that YASM doesn't support back-ticks for C-style backslash processing.
    mov     r9, 1<<56         ; decrement by 1 in the 2nd-last byte: LSB of the decimal string
    ;xor     r9d, r9d
    ;bts      r9, 56           ; IDK if this code-size optimization outside the loop would help or not.

    mov     eax, 8            ; string length.
    mov     edi, printbuf

.storeloop:

    ;;  rdx = "????x9\n".  We compute the start value for the next iteration, i.e. counter -= 10 in rdx.

    mov     r8, rdx
    ;;  r8 = rdx.  We modify it to have each last digit from 9 down to 0 in sequence, and store those strings in the buffer.
    ;;  The string could be any length, always with the first ASCII digit in the low byte; our other constants are adjusted correctly for it
    ;; narrower than 8B means that our stores overlap, but that's fine.

    ;; Starting from here to compute the next unrolled iteration's starting value takes the `sub r8, r9` instructions off the critical path, vs. if we started from r8 at the bottom of the loop.  This gives out-of-order execution more to play with.
    ;;  It means each loop iteration's sequence of subs and stores are a separate dependency chain (except for the store addresses, but OOO can get ahead on those because we only pointer-increment every 2 stores).

    mov     [rdi], r8
    sub     r8, r9             ; r8 = "xxx8\n"

    mov     [rdi + rax], r8    ; defer p += len by using a 2-reg addressing mode
    sub     r8, r9             ; r8 = "xxx7\n"

    lea     edi, [rdi + rax*2]  ; if we had len*3 in another reg, we could defer this longer
           ;; our static buffer is guaranteed to be in the low 31 bits of address space so we can safely save a REX prefix on the LEA here.  Normally you shouldn't truncate pointers to 32-bits, but you asked for the fastest possible.  This won't hurt, and might help on some CPUs, especially with possible decode bottlenecks.

    ;; repeat that block 3 more times.
    ;; using a short inner loop for the 9..0 last digit might be a win on some CPUs (like maybe Core2), depending on their front-end loop-buffer capabilities if the frontend is a bottleneck at all here.

    ;; anyway, then for the last one:
    mov     [rdi], r8             ; r8 = "xxx1\n"
    sub     r8, r9
    mov     [rdi + rax], r8       ; r8 = "xxx0\n"

    lea     edi, [rdi + rax*2]


    ;; compute next iteration's RDX.  It's probably a win to interleave some of this into the loop body, but out-of-order execution should do a reasonably good job here.
    mov     rcx, r9
    shr     rcx, 8      ; maybe hoist this constant out, too
    ; rcx = 1 in the second-lowest digit
    sub     rdx, rcx

    ; detect carry when '0' (0x30) - 1 = 0x2F by checking the low bit of the high nibble in that byte.
    shl     rcx, 5
    test    rdx, rcx
    jz      .carry_second_digit
    ; .carry_second_digit is some complicated code to propagate carry as far as it needs to go, up to the most-significant digit.
    ; when it's done, it re-enters the loop at the top, with eax and r9 set appropriately.
    ; it only runs once per 100 digits, so it doesn't have to be super-fast

    ; maybe only do buffer-length checks in the carry-handling branch,
    ; in which case the jz .carry  can be  jnz .storeloop
    cmp     edi, esi              ; } while(p < endp)
    jbe     .storeloop

    ; write() system call on the buffer.
    ; Maybe need a loop around this instead of doing all 5M integer-strings in one giant buffer.

    pop     rbx
    xor     eax,eax            ; successful exit status.
    ret
;;;优化版本:将字符串数据保存在寄存器中并对其进行修改
;;; 而不是每次都进行整型int->string转换。
第2节bss
printbuf:resb 1024*128+4096;缓冲区大小~=英特尔SnB系列上二级缓存大小的一半。或者使用我们编写()一次的巨大缓冲区。或者vmsplice将其分发给内核,因为我们只运行一次。
全球主要
外部打印
主要内容:
推送rbx
; 对值使用一些REX only REG,对于64位操作数大小,我们总是使用REX前缀。
mov rdx,`5000000\n`;(作为整数的NASM字符串常量的工作方式类似于little endian,因此AL='5'=0x35,高位字节保持'\n'=10)。请注意,YASM不支持C样式反斜杠处理的反斜杠。

mov r9,1对printf的调用完全控制着即使是效率极低的循环的运行时间。(您是否注意到,即使您从未在任何地方使用过rcx,您也会推/弹出它?这可能是使用过程中的遗留问题)

要了解有关编写高效x86 asm的更多信息,请参阅。(还有他的微体系结构指南,如果你想真正深入了解特定CPU的细节以及它们的不同之处:一个uarch CPU上的最佳配置可能不在另一个上。例如,IMUL r64在英特尔CPU上的吞吐量和延迟要比AMD好得多,但在英特尔pre Broadwell上CMOV和ADC是2个UOP,而在Intel pre Broadwell上则是2个周期延迟,而不是1个周期延迟。)AMD,因为3输入ALU m-ops(标志+两个寄存器)对AMD来说不是问题。)也可以在标签wiki中查看其他链接


纯粹优化循环而不更改对printf的5M调用仅作为如何正确编写循环的示例,而不是实际加速此代码的示例。但让我们从这个开始:

; trivial fixes to loop efficiently while calling the same slow function
global  main
extern  printf
main:
    push    rbx
    mov     ebx, 5000000         ; don't waste a REX prefix for constants that fit in 32 bits
.print:
    ;; removed the push/pops from inside the loop.
    ; Use call-preserved regs instead of saving/restoring stuff inside a loop yourself.
    mov     edi, format          ; static data / code always has a 32-bit address
    mov     esi, ebx
    xor     eax, eax             ; The x86-64 SysV ABI requires al = number of FP args passed in FP registers for variadic functions
    call    printf                  
    dec     ebx
    jnz     .print

    pop     rbx                ; restore rbx, the one call-preserved reg we actually used.
    xor     eax,eax            ; successful exit status.
    ret

section .rodata       ; it's usually best to put constant data in a separate section of the text segment, not right next to code.
format:
db  "%ld", 10, 0

为了加快速度,我们应该在将连续整数转换为字符串时利用冗余。因为
“5000000\n”
只有8字节长(包括