Assembly 汇编语言（x86）：如何创建循环来计算斐波那契序列_Assembly_X86_Masm_Fibonacci_Irvine32

Assembly 汇编语言（x86）：如何创建循环来计算斐波那契序列

assembly x86

Assembly 汇编语言（x86）：如何创建循环来计算斐波那契序列,assembly,x86,masm,fibonacci,irvine32,Assembly,X86,Masm,Fibonacci,Irvine32,我正在使用Visual Studio 2013 Ultimate在MASM中编程汇编语言（x86）。我试图使用数组来计算n个元素的斐波那契序列。换句话说，我试图转到一个数组元素，获取它前面的两个元素，将它们相加，并将结果存储到另一个数组中我在设置索引寄存器以使其工作时遇到问题我的程序设置如下： TITLE fibonacci.asm INCLUDE Irvine32.inc .data fibInitial BYTE 0, 1, 2, 3, 4, 5, 6 fibCom

我正在使用Visual Studio 2013 Ultimate在MASM中编程汇编语言（x86）。我试图使用数组来计算n个元素的斐波那契序列。换句话说，我试图转到一个数组元素，获取它前面的两个元素，将它们相加，并将结果存储到另一个数组中

我在设置索引寄存器以使其工作时遇到问题

我的程序设置如下：

TITLE fibonacci.asm

INCLUDE Irvine32.inc

.data
    fibInitial  BYTE 0, 1, 2, 3, 4, 5, 6
    fibComputed BYTE 5 DUP(0)

.code
main PROC

    MOVZX si, fibInitial
    MOVZX di, fibComputed
    MOV   cl, LENGTHOF fibInitial

L1:
    MOV   ax, [si - 1]
    MOV   dx, [si - 2]
    MOV   bp, ax + dx
    MOV   dl, TYPE fibInitial
    MOVZX si, dl
    MOV   [edi], bp
    MOV   dh, TYPE fibComputed
    MOVZX di, dl
    loop L1

exit
main ENDP
END main

我无法编译此代码，因为对于行

MOV ebp，ax+dx

，有一条错误消息显示“error A2031:必须是索引或基址寄存器”。但是，我确信我忽略了其他逻辑错误。

相关：Code golf打印Fib（10**9）的前1000位：使用扩展精度的adc循环，并将二进制转换为字符串。内环速度优化，其他部件尺寸优化

计算一个元素只需要保留两个状态：当前元素和前一个元素。除了计算长度，我不知道你想用fibInitial做什么。这不是perl，您需要花费$n（0..5）

我知道你只是在学习asm，但我还是要谈谈性能。没有太多的理由学习asm。如果您不需要性能，让编译器从C源代码为您生成asm。另请参见位于的其他链接

为您的状态使用寄存器简化了在计算

a[1]

时需要查看

a[-1]

的问题。您可以从

curr=1

，

prev=0

开始，然后从

a[0]=curr

开始。要生成从零序开始的“现代”，请从

curr=0

，

prev=1

开始

幸运的是，我最近正在考虑一个有效的斐波那契代码循环，所以我花时间写了一个完整的函数。有关展开和矢量化版本，请参见下文（保存存储指令，但即使在为32位CPU编译时，也能使64位整数速度更快）：

AMD CPU可以融合cmp/分支，但不能融合dec/分支。英特尔CPU还可以

dec/jnz

。（或符号小于零/大于零）

dec/inc

不更新进位标志，因此不能将其与上面/下面未签名的

ja/jb

一起使用。我认为这个想法是可以在循环中执行

adc

（添加进位），使用

inc/dec

循环计数器不干扰进位标志，但是

leaecx，[eax+edx]

需要一个额外的字节（地址大小前缀），这就是我使用32位dest和64位地址的原因。（这些是64位模式下

lea

的默认操作数大小）。对速度没有直接影响，只是通过代码大小间接影响

另一个循环体可以是：

    mov  ecx, eax      ; tmp=curr.  This stays true after every iteration
.loop:

    mov  [rdi], ecx
    add  ecx, edx      ; tmp+=prev  ;; shorter encoding than lea
    mov  edx, eax      ; prev=curr
    mov  eax, ecx      ; curr=tmp

展开循环以进行更多迭代将意味着更少的洗牌。您只需跟踪哪个寄存器保存哪个变量，而不是

mov

指令。i、 e.您通过某种寄存器重命名来处理分配

.loop:     ;; on entry:       ; curr:eax  prev:edx
    mov  [rdi], eax             ; store curr
    add  edx, eax             ; curr:edx  prev:eax
.oddentry:
    mov  [rdi + 4], edx         ; store curr
    add  eax, edx             ; curr:eax  prev:edx

    ;; we're back to our starting state, so we can loop
    add  rdi, 8
    cmp  rdi, rsi
    jb   .loop

展开的问题是，您需要清理任何剩余的奇数迭代。两个展开因子的威力可以使清理循环稍微容易一些，但添加12并不比添加16快。（参见本帖之前的版本，了解一个愚蠢的3次展开版本，使用

lea

在第三个寄存器中生成

curr+prev

，因为我没有意识到你实际上不需要临时工。感谢rcgldr捕捉到这一点。）

有关处理任何计数的完整工作展开版本，请参见下文

测试前端（此版本中新增：一个金丝雀元素，用于检测写入缓冲区末尾的asm错误。）

展开版本再次感谢rcgldr让我思考如何在循环设置中处理奇数与偶数计数，而不是在最后进行清理迭代

我选择了无分支设置代码，它将4*计数%2添加到起始指针。这可以是零，但添加零要比分支更便宜，以确定我们是否应该这样做。斐波那契序列会很快溢出寄存器，因此保持开场白代码的紧凑性和有效性非常重要，而不仅仅是循环中的代码。（如果我们要进行优化，我们希望优化许多短长度的呼叫）

而不是现在

curr = 1;
prev = count & 1;
buf += count & 1;

我们还可以通过使用

esi

保持

prev

，在两个版本中保存

mov

指令，现在

prev

取决于

计数
  ;; loop prologue for sequence starting with 1 1 2 3
  ;; (using different regs and optimized for size by using fewer immediates)
    mov    eax, 1               ; current = 1
    cmp    esi, eax
    jb     .early_out           ; count below 1
    mov    [rdi], eax
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    and    esi, eax             ; prev = count & 1
    lea    rdi, [rdi + rsi*4]   ; buf += count & 1
  ;; eax:curr esi:prev    rdx:endp  rdi:buf
  ;; end of old code

  ;; loop prologue for sequence starting with 0 1 1 2
    cmp    esi, 1
    jb     .early_out           ; count below 1, no stores
    mov    [rdi], 0             ; store first element
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    mov    eax, 1               ; prev = 1
    and    esi, eax             ; curr = count&1
    lea    rdi, [rdi + rsi*4]   ; buf += count&1
    xor    eax, esi             ; prev = 1^curr
    ;; ESI:curr EAX:prev  (opposite of other setup)
  ;;



矢量化：
斐波那契序列不是特别可并行的。没有简单的方法可以从F（i）和F（i-4）中得到F（i+4），或者类似的东西。我们能用向量做的就是减少对内存的存储。首先：
a = [f3 f2 f1 f0 ]   -> store this to buf
b = [f2 f1 f0 f-1]

然后a+=b；b+=a；a+=b；b+=a生成：
a = [f7 f6 f5 f4 ]   -> store this to buf
b = [f6 f5 f4 f3 ]

当处理压缩到128b向量中的两个64位整数时，这就不那么愚蠢了。即使在32位代码中，也可以使用SSE进行64位整数运算
此答案的早期版本具有未完成的压缩32位向量版本，无法正确处理计数%4！=0
。为了加载序列的前四个值，我使用了pmovzxbd
，因此当我只能使用4B时，我不需要16B的数据。获得第一名-1。。1将序列的值放入向量寄存器要容易得多，因为只有一个非零值可以加载和洗牌
;void fib64_sse(uint64_t *dest, uint32_t count);
; using SSE for fewer but larger stores, and for 64bit integers even in 32bit mode
global fib64_sse
fib64_sse:
    mov eax, 1
    movd    xmm1, eax               ; xmm1 = [0 1] = [f0 f-1]
    pshufd  xmm0, xmm1, 11001111b   ; xmm0 = [1 0] = [f1 f0]

    sub esi, 2
    jae .entry  ; make the common case faster with fewer branches
    ;; could put the handling for count==0 and count==1 right here, with its own ret

    jmp .cleanup
align 16
.loop:                          ; do {
    paddq   xmm0, xmm1          ; xmm0 = [ f3 f2 ]
.entry:
    ;; xmm1: [ f0 f-1 ]         ; on initial entry, count already decremented by 2
    ;; xmm0: [ f1 f0  ]
    paddq   xmm1, xmm0          ; xmm1 = [ f4 f3 ]  (or [ f2 f1 ] on first iter)
    movdqu  [rdi], xmm0         ; store 2nd last compute result, ready for cleanup of odd count
        add     rdi, 16         ;   buf += 2
    sub esi, 2
        jae   .loop             ; } while((count-=2) >= 0);
    .cleanup:
    ;; esi <= 0 : -2 on the count=0 special case, otherwise -1 or 0

    ;; xmm1: [ f_rc   f_rc-1 ]  ; rc = count Rounded down to even: count & ~1
    ;; xmm0: [ f_rc+1 f_rc   ]  ; f(rc+1) is the value we need to store if count was odd
    cmp esi, -1
    jne   .out  ; this could be a test on the Parity flag, with no extra cmp, if we wanted to be really hard to read and need a big comment explaining the logic
    ;; xmm1 = [f1 f0]
    movhps  [rdi], xmm1         ; store the high 64b of xmm0.  There is no integer version of this insn, but that doesn't matter
    .out:
        ret

；无效光纤64_-sse（uint64_-t*dest，uint32_-t计数）；
; 将SSE用于更少但更大的存储，以及64位整数（即使在32位模式下）
全球fib64_sse
fib64_sse：
mov-eax，1
movdxmm1，eax；xmm1=[01]=[f0 f-1]
pshufd xmm0，xmm1，11001111b；xmm0=[1 0]=[f1 f0]
副esi，2
(二)入境;；使用更少的分支使常见情况更快
;; 可以将count==0和count==1的处理放在这里，并使用自己的ret
jmp.cleanu
curr=count&1;   // and esi, 1
buf += curr;    // lea [rdi], [rdi + rsi*4]
prev= 1 ^ curr; // xor eax, esi

curr = 1;
prev = count & 1;
buf += count & 1;

  ;; loop prologue for sequence starting with 1 1 2 3
  ;; (using different regs and optimized for size by using fewer immediates)
    mov    eax, 1               ; current = 1
    cmp    esi, eax
    jb     .early_out           ; count below 1
    mov    [rdi], eax
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    and    esi, eax             ; prev = count & 1
    lea    rdi, [rdi + rsi*4]   ; buf += count & 1
  ;; eax:curr esi:prev    rdx:endp  rdi:buf
  ;; end of old code

  ;; loop prologue for sequence starting with 0 1 1 2
    cmp    esi, 1
    jb     .early_out           ; count below 1, no stores
    mov    [rdi], 0             ; store first element
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    mov    eax, 1               ; prev = 1
    and    esi, eax             ; curr = count&1
    lea    rdi, [rdi + rsi*4]   ; buf += count&1
    xor    eax, esi             ; prev = 1^curr
    ;; ESI:curr EAX:prev  (opposite of other setup)
  ;;

  ;; optimized for code size, NOT speed.  Prob. could be smaller, esp. if we want to keep the loop start aligned, and jump between before and after it.
  ;; most of the savings are from avoiding mov reg, imm32,
  ;; and from counting down the loop counter, instead of checking an end-pointer.
  ;; loop prologue for sequence starting with 0 1 1 2
    xor    edx, edx
    cmp    esi, 1
    jb     .early_out         ; count below 1, no stores
    mov    [rdi], edx         ; store first element
    je     .early_out         ; count == 1, flags still set from cmp

    xor    eax, eax  ; movzx after setcc would be faster, but one more byte
    shr    esi, 1             ; two counts per iteration, divide by two
  ;; shift sets CF = the last bit shifted out
    setc   al                 ; curr =   count&1
    setnc  dl                 ; prev = !(count&1)

    lea    rdi, [rdi + rax*4] ; buf+= count&1

  ;; extra uop or partial register stall internally when reading eax after writing al, on Intel (except P4 & silvermont)
  ;; EAX:curr EDX:prev  (same as 1 1 2 setup)
  ;; even count: loop starts at buf[0], with curr=0, prev=1
  ;; odd  count: loop starts at buf[1], with curr=1, prev=0

  .loop:
       ...
    dec  esi                  ; 1B smaller than 64b cmp, needs count/2 in esi
    jnz .loop
  .early_out:
    ret

a = [f3 f2 f1 f0 ]   -> store this to buf
b = [f2 f1 f0 f-1]

a = [f7 f6 f5 f4 ]   -> store this to buf
b = [f6 f5 f4 f3 ]

;void fib64_sse(uint64_t *dest, uint32_t count);
; using SSE for fewer but larger stores, and for 64bit integers even in 32bit mode
global fib64_sse
fib64_sse:
    mov eax, 1
    movd    xmm1, eax               ; xmm1 = [0 1] = [f0 f-1]
    pshufd  xmm0, xmm1, 11001111b   ; xmm0 = [1 0] = [f1 f0]

    sub esi, 2
    jae .entry  ; make the common case faster with fewer branches
    ;; could put the handling for count==0 and count==1 right here, with its own ret

    jmp .cleanup
align 16
.loop:                          ; do {
    paddq   xmm0, xmm1          ; xmm0 = [ f3 f2 ]
.entry:
    ;; xmm1: [ f0 f-1 ]         ; on initial entry, count already decremented by 2
    ;; xmm0: [ f1 f0  ]
    paddq   xmm1, xmm0          ; xmm1 = [ f4 f3 ]  (or [ f2 f1 ] on first iter)
    movdqu  [rdi], xmm0         ; store 2nd last compute result, ready for cleanup of odd count
        add     rdi, 16         ;   buf += 2
    sub esi, 2
        jae   .loop             ; } while((count-=2) >= 0);
    .cleanup:
    ;; esi <= 0 : -2 on the count=0 special case, otherwise -1 or 0

    ;; xmm1: [ f_rc   f_rc-1 ]  ; rc = count Rounded down to even: count & ~1
    ;; xmm0: [ f_rc+1 f_rc   ]  ; f(rc+1) is the value we need to store if count was odd
    cmp esi, -1
    jne   .out  ; this could be a test on the Parity flag, with no extra cmp, if we wanted to be really hard to read and need a big comment explaining the logic
    ;; xmm1 = [f1 f0]
    movhps  [rdi], xmm1         ; store the high 64b of xmm0.  There is no integer version of this insn, but that doesn't matter
    .out:
        ret

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <stdlib.h>

#ifdef USE32
void fib(uint32_t *buf, uint32_t count);
typedef uint32_t buftype_t;
#define FMTx PRIx32
#define FMTu PRIu32
#define FIB_FN fib
#define CANARY 0xdeadbeefUL
#else
void fib64_sse(uint64_t *buf, uint32_t count);
typedef uint64_t buftype_t;
#define FMTx PRIx64
#define FMTu PRIu64
#define FIB_FN fib64_sse
#define CANARY 0xdeadbeefdeadc0deULL
#endif

#define xstr(s) str(s)
#define str(s) #s

int main(int argc, const char *argv[]) {
    uint32_t count = 15;
    if (argc > 1) {
        count = atoi(argv[1]);
    }
    int benchmark = argc > 2;

    buftype_t buf[count+1]; // allocated on the stack
    // Fib overflows uint32 at count = 48, so it's not like a lot of space is useful

    buf[count] = CANARY;
    // uint32_t count = sizeof(buf)/sizeof(buf[0]);
    if (benchmark) {
    int64_t reps = 1000000000 / count;
    for (int i=0 ; i<=reps ; i++)
        FIB_FN(buf, count);

    } else {
    FIB_FN(buf, count);
    for (uint32_t i ; i < count ; i++){
        printf("%" FMTu " ", buf[i]);
    }
    putchar('\n');
    }
    if (buf[count] != CANARY) {
        printf(xstr(FIB_FN) " wrote past the end of buf: sentinel = %" FMTx "\n", buf[count]);
    }
}

/* lucas sequence method */
uint64_t fibl(int n) {
    uint64_t a, b, p, q, qq, aq;
    a = q = 1;
    b = p = 0;
    while(1){
        if(n & 1) {
            aq = a*q;
            a = b*q + aq + a*p;
            b = b*p + aq;
        }
        n >>= 1;
        if(n == 0)
            break;
        qq = q*q;
        q = 2*p*q + qq;
        p = p*p + qq;
    }
    return b;
}

.386
.model flat, stdcall
.stack 4096
ExitProcess proto, dwExitCode:dword

.data
    fib word 1, 1, 5 dup(?);you create an array with the number of the fibonacci series that you want to get
.code
main proc
    mov esi, offset fib ;set the stack index to the offset of the array.Note that this can also be set to 0
    mov cx, lengthof fib ;set the counter for the array to the length of the array. This keeps track of the number of times your loop will go

L1: ;start the loop
    mov ax, [esi]; move the first element to ax ;move the first element in the array to the ax register
    add ax, [esi + type fib]; add the second element to the value in ax. Which gives the next element in the series
    mov[esi + 2* type fib], ax; assign the addition to the third value in the array, i.e the next number in the fibonacci series
    add esi, type fib;increment the index to move to the next value
    loop L1; repeat

    invoke ExitProcess, 0
main endp
end main