Assembly 汇编语言(x86):如何创建循环来计算斐波那契序列

Assembly 汇编语言(x86):如何创建循环来计算斐波那契序列,assembly,x86,masm,fibonacci,irvine32,Assembly,X86,Masm,Fibonacci,Irvine32,我正在使用Visual Studio 2013 Ultimate在MASM中编程汇编语言(x86)。我试图使用数组来计算n个元素的斐波那契序列。换句话说,我试图转到一个数组元素,获取它前面的两个元素,将它们相加,并将结果存储到另一个数组中 我在设置索引寄存器以使其工作时遇到问题 我的程序设置如下: TITLE fibonacci.asm INCLUDE Irvine32.inc .data fibInitial BYTE 0, 1, 2, 3, 4, 5, 6 fibCom

我正在使用Visual Studio 2013 Ultimate在MASM中编程汇编语言(x86)。我试图使用数组来计算n个元素的斐波那契序列。换句话说,我试图转到一个数组元素,获取它前面的两个元素,将它们相加,并将结果存储到另一个数组中

我在设置索引寄存器以使其工作时遇到问题

我的程序设置如下:

TITLE fibonacci.asm

INCLUDE Irvine32.inc

.data
    fibInitial  BYTE 0, 1, 2, 3, 4, 5, 6
    fibComputed BYTE 5 DUP(0)

.code
main PROC

    MOVZX si, fibInitial
    MOVZX di, fibComputed
    MOV   cl, LENGTHOF fibInitial

L1:
    MOV   ax, [si - 1]
    MOV   dx, [si - 2]
    MOV   bp, ax + dx
    MOV   dl, TYPE fibInitial
    MOVZX si, dl
    MOV   [edi], bp
    MOV   dh, TYPE fibComputed
    MOVZX di, dl
    loop L1

exit
main ENDP
END main

我无法编译此代码,因为对于行
MOV ebp,ax+dx
,有一条错误消息显示“error A2031:必须是索引或基址寄存器”。但是,我确信我忽略了其他逻辑错误。

相关:Code golf打印Fib(10**9)的前1000位:使用扩展精度的adc循环,并将二进制转换为字符串。内环速度优化,其他部件尺寸优化


计算一个元素只需要保留两个状态:当前元素和前一个元素。除了计算长度,我不知道你想用fibInitial做什么。这不是perl,您需要花费$n(0..5)

我知道你只是在学习asm,但我还是要谈谈性能。没有太多的理由学习asm。如果您不需要性能,让编译器从C源代码为您生成asm。另请参见位于的其他链接

为您的状态使用寄存器简化了在计算
a[1]
时需要查看
a[-1]
的问题。您可以从
curr=1
prev=0
开始,然后从
a[0]=curr
开始。要生成从零序开始的“现代”,请从
curr=0
prev=1
开始

幸运的是,我最近正在考虑一个有效的斐波那契代码循环,所以我花时间写了一个完整的函数。有关展开和矢量化版本,请参见下文(保存存储指令,但即使在为32位CPU编译时,也能使64位整数速度更快):

AMD CPU可以融合cmp/分支,但不能融合dec/分支。英特尔CPU还可以
dec/jnz
。(或符号小于零/大于零)
dec/inc
不更新进位标志,因此不能将其与上面/下面未签名的
ja/jb
一起使用。我认为这个想法是可以在循环中执行
adc
(添加进位),使用
inc/dec
循环计数器不干扰进位标志,但是

leaecx,[eax+edx]
需要一个额外的字节(地址大小前缀),这就是我使用32位dest和64位地址的原因。(这些是64位模式下
lea
的默认操作数大小)。对速度没有直接影响,只是通过代码大小间接影响

另一个循环体可以是:

    mov  ecx, eax      ; tmp=curr.  This stays true after every iteration
.loop:

    mov  [rdi], ecx
    add  ecx, edx      ; tmp+=prev  ;; shorter encoding than lea
    mov  edx, eax      ; prev=curr
    mov  eax, ecx      ; curr=tmp
展开循环以进行更多迭代将意味着更少的洗牌。您只需跟踪哪个寄存器保存哪个变量,而不是
mov
指令。i、 e.您通过某种寄存器重命名来处理分配

.loop:     ;; on entry:       ; curr:eax  prev:edx
    mov  [rdi], eax             ; store curr
    add  edx, eax             ; curr:edx  prev:eax
.oddentry:
    mov  [rdi + 4], edx         ; store curr
    add  eax, edx             ; curr:eax  prev:edx

    ;; we're back to our starting state, so we can loop
    add  rdi, 8
    cmp  rdi, rsi
    jb   .loop
展开的问题是,您需要清理任何剩余的奇数迭代。两个展开因子的威力可以使清理循环稍微容易一些,但添加12并不比添加16快。(参见本帖之前的版本,了解一个愚蠢的3次展开版本,使用
lea
在第三个寄存器中生成
curr+prev
,因为我没有意识到你实际上不需要临时工。感谢rcgldr捕捉到这一点。)

有关处理任何计数的完整工作展开版本,请参见下文


测试前端(此版本中新增:一个金丝雀元素,用于检测写入缓冲区末尾的asm错误。)


展开版本 再次感谢rcgldr让我思考如何在循环设置中处理奇数与偶数计数,而不是在最后进行清理迭代

我选择了无分支设置代码,它将4*计数%2添加到起始指针。这可以是零,但添加零要比分支更便宜,以确定我们是否应该这样做。斐波那契序列会很快溢出寄存器,因此保持开场白代码的紧凑性和有效性非常重要,而不仅仅是循环中的代码。(如果我们要进行优化,我们希望优化许多短长度的呼叫)

而不是现在

curr = 1;
prev = count & 1;
buf += count & 1;
我们还可以通过使用
esi
保持
prev
,在两个版本中保存
mov
指令,现在
prev
取决于
计数

  ;; loop prologue for sequence starting with 1 1 2 3
  ;; (using different regs and optimized for size by using fewer immediates)
    mov    eax, 1               ; current = 1
    cmp    esi, eax
    jb     .early_out           ; count below 1
    mov    [rdi], eax
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    and    esi, eax             ; prev = count & 1
    lea    rdi, [rdi + rsi*4]   ; buf += count & 1
  ;; eax:curr esi:prev    rdx:endp  rdi:buf
  ;; end of old code

  ;; loop prologue for sequence starting with 0 1 1 2
    cmp    esi, 1
    jb     .early_out           ; count below 1, no stores
    mov    [rdi], 0             ; store first element
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    mov    eax, 1               ; prev = 1
    and    esi, eax             ; curr = count&1
    lea    rdi, [rdi + rsi*4]   ; buf += count&1
    xor    eax, esi             ; prev = 1^curr
    ;; ESI:curr EAX:prev  (opposite of other setup)
  ;;


矢量化: 斐波那契序列不是特别可并行的。没有简单的方法可以从F(i)和F(i-4)中得到F(i+4),或者类似的东西。我们能用向量做的就是减少对内存的存储。首先:

a = [f3 f2 f1 f0 ]   -> store this to buf
b = [f2 f1 f0 f-1]
然后
a+=b;b+=a;a+=b;b+=a生成:

a = [f7 f6 f5 f4 ]   -> store this to buf
b = [f6 f5 f4 f3 ]
当处理压缩到128b向量中的两个64位整数时,这就不那么愚蠢了。即使在32位代码中,也可以使用SSE进行64位整数运算

此答案的早期版本具有未完成的压缩32位向量版本,无法正确处理计数%4!=0
。为了加载序列的前四个值,我使用了
pmovzxbd
,因此当我只能使用4B时,我不需要16B的数据。获得第一名-1。。1将序列的值放入向量寄存器要容易得多,因为只有一个非零值可以加载和洗牌

;void fib64_sse(uint64_t *dest, uint32_t count);
; using SSE for fewer but larger stores, and for 64bit integers even in 32bit mode
global fib64_sse
fib64_sse:
    mov eax, 1
    movd    xmm1, eax               ; xmm1 = [0 1] = [f0 f-1]
    pshufd  xmm0, xmm1, 11001111b   ; xmm0 = [1 0] = [f1 f0]

    sub esi, 2
    jae .entry  ; make the common case faster with fewer branches
    ;; could put the handling for count==0 and count==1 right here, with its own ret

    jmp .cleanup
align 16
.loop:                          ; do {
    paddq   xmm0, xmm1          ; xmm0 = [ f3 f2 ]
.entry:
    ;; xmm1: [ f0 f-1 ]         ; on initial entry, count already decremented by 2
    ;; xmm0: [ f1 f0  ]
    paddq   xmm1, xmm0          ; xmm1 = [ f4 f3 ]  (or [ f2 f1 ] on first iter)
    movdqu  [rdi], xmm0         ; store 2nd last compute result, ready for cleanup of odd count
        add     rdi, 16         ;   buf += 2
    sub esi, 2
        jae   .loop             ; } while((count-=2) >= 0);
    .cleanup:
    ;; esi <= 0 : -2 on the count=0 special case, otherwise -1 or 0

    ;; xmm1: [ f_rc   f_rc-1 ]  ; rc = count Rounded down to even: count & ~1
    ;; xmm0: [ f_rc+1 f_rc   ]  ; f(rc+1) is the value we need to store if count was odd
    cmp esi, -1
    jne   .out  ; this could be a test on the Parity flag, with no extra cmp, if we wanted to be really hard to read and need a big comment explaining the logic
    ;; xmm1 = [f1 f0]
    movhps  [rdi], xmm1         ; store the high 64b of xmm0.  There is no integer version of this insn, but that doesn't matter
    .out:
        ret
;无效光纤64_-sse(uint64_-t*dest,uint32_-t计数);
; 将SSE用于更少但更大的存储,以及64位整数(即使在32位模式下)
全球fib64_sse
fib64_sse:
mov-eax,1
movdxmm1,eax;xmm1=[01]=[f0 f-1]
pshufd xmm0,xmm1,11001111b;xmm0=[1 0]=[f1 f0]
副esi,2
(二)入境;;使用更少的分支使常见情况更快
;; 可以将count==0和count==1的处理放在这里,并使用自己的ret
jmp.cleanu
curr=count&1;   // and esi, 1
buf += curr;    // lea [rdi], [rdi + rsi*4]
prev= 1 ^ curr; // xor eax, esi
curr = 1;
prev = count & 1;
buf += count & 1;
  ;; loop prologue for sequence starting with 1 1 2 3
  ;; (using different regs and optimized for size by using fewer immediates)
    mov    eax, 1               ; current = 1
    cmp    esi, eax
    jb     .early_out           ; count below 1
    mov    [rdi], eax
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    and    esi, eax             ; prev = count & 1
    lea    rdi, [rdi + rsi*4]   ; buf += count & 1
  ;; eax:curr esi:prev    rdx:endp  rdi:buf
  ;; end of old code

  ;; loop prologue for sequence starting with 0 1 1 2
    cmp    esi, 1
    jb     .early_out           ; count below 1, no stores
    mov    [rdi], 0             ; store first element
    je     .early_out           ; count == 1, flags still set from cmp

    lea    rdx, [rdi + rsi*4]   ; endp
    mov    eax, 1               ; prev = 1
    and    esi, eax             ; curr = count&1
    lea    rdi, [rdi + rsi*4]   ; buf += count&1
    xor    eax, esi             ; prev = 1^curr
    ;; ESI:curr EAX:prev  (opposite of other setup)
  ;;
  ;; optimized for code size, NOT speed.  Prob. could be smaller, esp. if we want to keep the loop start aligned, and jump between before and after it.
  ;; most of the savings are from avoiding mov reg, imm32,
  ;; and from counting down the loop counter, instead of checking an end-pointer.
  ;; loop prologue for sequence starting with 0 1 1 2
    xor    edx, edx
    cmp    esi, 1
    jb     .early_out         ; count below 1, no stores
    mov    [rdi], edx         ; store first element
    je     .early_out         ; count == 1, flags still set from cmp

    xor    eax, eax  ; movzx after setcc would be faster, but one more byte
    shr    esi, 1             ; two counts per iteration, divide by two
  ;; shift sets CF = the last bit shifted out
    setc   al                 ; curr =   count&1
    setnc  dl                 ; prev = !(count&1)

    lea    rdi, [rdi + rax*4] ; buf+= count&1

  ;; extra uop or partial register stall internally when reading eax after writing al, on Intel (except P4 & silvermont)
  ;; EAX:curr EDX:prev  (same as 1 1 2 setup)
  ;; even count: loop starts at buf[0], with curr=0, prev=1
  ;; odd  count: loop starts at buf[1], with curr=1, prev=0

  .loop:
       ...
    dec  esi                  ; 1B smaller than 64b cmp, needs count/2 in esi
    jnz .loop
  .early_out:
    ret
a = [f3 f2 f1 f0 ]   -> store this to buf
b = [f2 f1 f0 f-1]
a = [f7 f6 f5 f4 ]   -> store this to buf
b = [f6 f5 f4 f3 ]
;void fib64_sse(uint64_t *dest, uint32_t count);
; using SSE for fewer but larger stores, and for 64bit integers even in 32bit mode
global fib64_sse
fib64_sse:
    mov eax, 1
    movd    xmm1, eax               ; xmm1 = [0 1] = [f0 f-1]
    pshufd  xmm0, xmm1, 11001111b   ; xmm0 = [1 0] = [f1 f0]

    sub esi, 2
    jae .entry  ; make the common case faster with fewer branches
    ;; could put the handling for count==0 and count==1 right here, with its own ret

    jmp .cleanup
align 16
.loop:                          ; do {
    paddq   xmm0, xmm1          ; xmm0 = [ f3 f2 ]
.entry:
    ;; xmm1: [ f0 f-1 ]         ; on initial entry, count already decremented by 2
    ;; xmm0: [ f1 f0  ]
    paddq   xmm1, xmm0          ; xmm1 = [ f4 f3 ]  (or [ f2 f1 ] on first iter)
    movdqu  [rdi], xmm0         ; store 2nd last compute result, ready for cleanup of odd count
        add     rdi, 16         ;   buf += 2
    sub esi, 2
        jae   .loop             ; } while((count-=2) >= 0);
    .cleanup:
    ;; esi <= 0 : -2 on the count=0 special case, otherwise -1 or 0

    ;; xmm1: [ f_rc   f_rc-1 ]  ; rc = count Rounded down to even: count & ~1
    ;; xmm0: [ f_rc+1 f_rc   ]  ; f(rc+1) is the value we need to store if count was odd
    cmp esi, -1
    jne   .out  ; this could be a test on the Parity flag, with no extra cmp, if we wanted to be really hard to read and need a big comment explaining the logic
    ;; xmm1 = [f1 f0]
    movhps  [rdi], xmm1         ; store the high 64b of xmm0.  There is no integer version of this insn, but that doesn't matter
    .out:
        ret
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <stdlib.h>

#ifdef USE32
void fib(uint32_t *buf, uint32_t count);
typedef uint32_t buftype_t;
#define FMTx PRIx32
#define FMTu PRIu32
#define FIB_FN fib
#define CANARY 0xdeadbeefUL
#else
void fib64_sse(uint64_t *buf, uint32_t count);
typedef uint64_t buftype_t;
#define FMTx PRIx64
#define FMTu PRIu64
#define FIB_FN fib64_sse
#define CANARY 0xdeadbeefdeadc0deULL
#endif

#define xstr(s) str(s)
#define str(s) #s

int main(int argc, const char *argv[]) {
    uint32_t count = 15;
    if (argc > 1) {
        count = atoi(argv[1]);
    }
    int benchmark = argc > 2;

    buftype_t buf[count+1]; // allocated on the stack
    // Fib overflows uint32 at count = 48, so it's not like a lot of space is useful

    buf[count] = CANARY;
    // uint32_t count = sizeof(buf)/sizeof(buf[0]);
    if (benchmark) {
    int64_t reps = 1000000000 / count;
    for (int i=0 ; i<=reps ; i++)
        FIB_FN(buf, count);

    } else {
    FIB_FN(buf, count);
    for (uint32_t i ; i < count ; i++){
        printf("%" FMTu " ", buf[i]);
    }
    putchar('\n');
    }
    if (buf[count] != CANARY) {
        printf(xstr(FIB_FN) " wrote past the end of buf: sentinel = %" FMTx "\n", buf[count]);
    }
}
/* lucas sequence method */
uint64_t fibl(int n) {
    uint64_t a, b, p, q, qq, aq;
    a = q = 1;
    b = p = 0;
    while(1){
        if(n & 1) {
            aq = a*q;
            a = b*q + aq + a*p;
            b = b*p + aq;
        }
        n >>= 1;
        if(n == 0)
            break;
        qq = q*q;
        q = 2*p*q + qq;
        p = p*p + qq;
    }
    return b;
}
.386
.model flat, stdcall
.stack 4096
ExitProcess proto, dwExitCode:dword

.data
    fib word 1, 1, 5 dup(?);you create an array with the number of the fibonacci series that you want to get
.code
main proc
    mov esi, offset fib ;set the stack index to the offset of the array.Note that this can also be set to 0
    mov cx, lengthof fib ;set the counter for the array to the length of the array. This keeps track of the number of times your loop will go

L1: ;start the loop
    mov ax, [esi]; move the first element to ax ;move the first element in the array to the ax register
    add ax, [esi + type fib]; add the second element to the value in ax. Which gives the next element in the series
    mov[esi + 2* type fib], ax; assign the addition to the third value in the array, i.e the next number in the fibonacci series
    add esi, type fib;increment the index to move to the next value
    loop L1; repeat

    invoke ExitProcess, 0
main endp
end main