Assembly 汇编语言(x86):如何创建循环来计算斐波那契序列
我正在使用Visual Studio 2013 Ultimate在MASM中编程汇编语言(x86)。我试图使用数组来计算n个元素的斐波那契序列。换句话说,我试图转到一个数组元素,获取它前面的两个元素,将它们相加,并将结果存储到另一个数组中 我在设置索引寄存器以使其工作时遇到问题 我的程序设置如下:Assembly 汇编语言(x86):如何创建循环来计算斐波那契序列,assembly,x86,masm,fibonacci,irvine32,Assembly,X86,Masm,Fibonacci,Irvine32,我正在使用Visual Studio 2013 Ultimate在MASM中编程汇编语言(x86)。我试图使用数组来计算n个元素的斐波那契序列。换句话说,我试图转到一个数组元素,获取它前面的两个元素,将它们相加,并将结果存储到另一个数组中 我在设置索引寄存器以使其工作时遇到问题 我的程序设置如下: TITLE fibonacci.asm INCLUDE Irvine32.inc .data fibInitial BYTE 0, 1, 2, 3, 4, 5, 6 fibCom
TITLE fibonacci.asm
INCLUDE Irvine32.inc
.data
fibInitial BYTE 0, 1, 2, 3, 4, 5, 6
fibComputed BYTE 5 DUP(0)
.code
main PROC
MOVZX si, fibInitial
MOVZX di, fibComputed
MOV cl, LENGTHOF fibInitial
L1:
MOV ax, [si - 1]
MOV dx, [si - 2]
MOV bp, ax + dx
MOV dl, TYPE fibInitial
MOVZX si, dl
MOV [edi], bp
MOV dh, TYPE fibComputed
MOVZX di, dl
loop L1
exit
main ENDP
END main
我无法编译此代码,因为对于行
MOV ebp,ax+dx
,有一条错误消息显示“error A2031:必须是索引或基址寄存器”。但是,我确信我忽略了其他逻辑错误。相关:Code golf打印Fib(10**9)的前1000位:使用扩展精度的adc循环,并将二进制转换为字符串。内环速度优化,其他部件尺寸优化
计算一个元素只需要保留两个状态:当前元素和前一个元素。除了计算长度,我不知道你想用fibInitial做什么。这不是perl,您需要花费$n(0..5) 我知道你只是在学习asm,但我还是要谈谈性能。没有太多的理由学习asm。如果您不需要性能,让编译器从C源代码为您生成asm。另请参见位于的其他链接 为您的状态使用寄存器简化了在计算
a[1]
时需要查看a[-1]
的问题。您可以从curr=1
,prev=0
开始,然后从a[0]=curr
开始。要生成从零序开始的“现代”,请从curr=0
,prev=1
开始
幸运的是,我最近正在考虑一个有效的斐波那契代码循环,所以我花时间写了一个完整的函数。有关展开和矢量化版本,请参见下文(保存存储指令,但即使在为32位CPU编译时,也能使64位整数速度更快):
AMD CPU可以融合cmp/分支,但不能融合dec/分支。英特尔CPU还可以dec/jnz
。(或符号小于零/大于零)dec/inc
不更新进位标志,因此不能将其与上面/下面未签名的ja/jb
一起使用。我认为这个想法是可以在循环中执行adc
(添加进位),使用inc/dec
循环计数器不干扰进位标志,但是
leaecx,[eax+edx]
需要一个额外的字节(地址大小前缀),这就是我使用32位dest和64位地址的原因。(这些是64位模式下lea
的默认操作数大小)。对速度没有直接影响,只是通过代码大小间接影响
另一个循环体可以是:
mov ecx, eax ; tmp=curr. This stays true after every iteration
.loop:
mov [rdi], ecx
add ecx, edx ; tmp+=prev ;; shorter encoding than lea
mov edx, eax ; prev=curr
mov eax, ecx ; curr=tmp
展开循环以进行更多迭代将意味着更少的洗牌。您只需跟踪哪个寄存器保存哪个变量,而不是mov
指令。i、 e.您通过某种寄存器重命名来处理分配
.loop: ;; on entry: ; curr:eax prev:edx
mov [rdi], eax ; store curr
add edx, eax ; curr:edx prev:eax
.oddentry:
mov [rdi + 4], edx ; store curr
add eax, edx ; curr:eax prev:edx
;; we're back to our starting state, so we can loop
add rdi, 8
cmp rdi, rsi
jb .loop
展开的问题是,您需要清理任何剩余的奇数迭代。两个展开因子的威力可以使清理循环稍微容易一些,但添加12并不比添加16快。(参见本帖之前的版本,了解一个愚蠢的3次展开版本,使用lea
在第三个寄存器中生成curr+prev
,因为我没有意识到你实际上不需要临时工。感谢rcgldr捕捉到这一点。)
有关处理任何计数的完整工作展开版本,请参见下文
测试前端(此版本中新增:一个金丝雀元素,用于检测写入缓冲区末尾的asm错误。)
展开版本 再次感谢rcgldr让我思考如何在循环设置中处理奇数与偶数计数,而不是在最后进行清理迭代 我选择了无分支设置代码,它将4*计数%2添加到起始指针。这可以是零,但添加零要比分支更便宜,以确定我们是否应该这样做。斐波那契序列会很快溢出寄存器,因此保持开场白代码的紧凑性和有效性非常重要,而不仅仅是循环中的代码。(如果我们要进行优化,我们希望优化许多短长度的呼叫) 而不是现在
curr = 1;
prev = count & 1;
buf += count & 1;
我们还可以通过使用esi
保持prev
,在两个版本中保存mov
指令,现在prev
取决于计数
;; loop prologue for sequence starting with 1 1 2 3
;; (using different regs and optimized for size by using fewer immediates)
mov eax, 1 ; current = 1
cmp esi, eax
jb .early_out ; count below 1
mov [rdi], eax
je .early_out ; count == 1, flags still set from cmp
lea rdx, [rdi + rsi*4] ; endp
and esi, eax ; prev = count & 1
lea rdi, [rdi + rsi*4] ; buf += count & 1
;; eax:curr esi:prev rdx:endp rdi:buf
;; end of old code
;; loop prologue for sequence starting with 0 1 1 2
cmp esi, 1
jb .early_out ; count below 1, no stores
mov [rdi], 0 ; store first element
je .early_out ; count == 1, flags still set from cmp
lea rdx, [rdi + rsi*4] ; endp
mov eax, 1 ; prev = 1
and esi, eax ; curr = count&1
lea rdi, [rdi + rsi*4] ; buf += count&1
xor eax, esi ; prev = 1^curr
;; ESI:curr EAX:prev (opposite of other setup)
;;
矢量化:
斐波那契序列不是特别可并行的。没有简单的方法可以从F(i)和F(i-4)中得到F(i+4),或者类似的东西。我们能用向量做的就是减少对内存的存储。首先:
a = [f3 f2 f1 f0 ] -> store this to buf
b = [f2 f1 f0 f-1]
然后a+=b;b+=a;a+=b;b+=a代码>生成:
a = [f7 f6 f5 f4 ] -> store this to buf
b = [f6 f5 f4 f3 ]
当处理压缩到128b向量中的两个64位整数时,这就不那么愚蠢了。即使在32位代码中,也可以使用SSE进行64位整数运算
此答案的早期版本具有未完成的压缩32位向量版本,无法正确处理计数%4!=0
。为了加载序列的前四个值,我使用了pmovzxbd
,因此当我只能使用4B时,我不需要16B的数据。获得第一名-1。。1将序列的值放入向量寄存器要容易得多,因为只有一个非零值可以加载和洗牌
;void fib64_sse(uint64_t *dest, uint32_t count);
; using SSE for fewer but larger stores, and for 64bit integers even in 32bit mode
global fib64_sse
fib64_sse:
mov eax, 1
movd xmm1, eax ; xmm1 = [0 1] = [f0 f-1]
pshufd xmm0, xmm1, 11001111b ; xmm0 = [1 0] = [f1 f0]
sub esi, 2
jae .entry ; make the common case faster with fewer branches
;; could put the handling for count==0 and count==1 right here, with its own ret
jmp .cleanup
align 16
.loop: ; do {
paddq xmm0, xmm1 ; xmm0 = [ f3 f2 ]
.entry:
;; xmm1: [ f0 f-1 ] ; on initial entry, count already decremented by 2
;; xmm0: [ f1 f0 ]
paddq xmm1, xmm0 ; xmm1 = [ f4 f3 ] (or [ f2 f1 ] on first iter)
movdqu [rdi], xmm0 ; store 2nd last compute result, ready for cleanup of odd count
add rdi, 16 ; buf += 2
sub esi, 2
jae .loop ; } while((count-=2) >= 0);
.cleanup:
;; esi <= 0 : -2 on the count=0 special case, otherwise -1 or 0
;; xmm1: [ f_rc f_rc-1 ] ; rc = count Rounded down to even: count & ~1
;; xmm0: [ f_rc+1 f_rc ] ; f(rc+1) is the value we need to store if count was odd
cmp esi, -1
jne .out ; this could be a test on the Parity flag, with no extra cmp, if we wanted to be really hard to read and need a big comment explaining the logic
;; xmm1 = [f1 f0]
movhps [rdi], xmm1 ; store the high 64b of xmm0. There is no integer version of this insn, but that doesn't matter
.out:
ret
;无效光纤64_-sse(uint64_-t*dest,uint32_-t计数);
; 将SSE用于更少但更大的存储,以及64位整数(即使在32位模式下)
全球fib64_sse
fib64_sse:
mov-eax,1
movdxmm1,eax;xmm1=[01]=[f0 f-1]
pshufd xmm0,xmm1,11001111b;xmm0=[1 0]=[f1 f0]
副esi,2
(二)入境;;使用更少的分支使常见情况更快
;; 可以将count==0和count==1的处理放在这里,并使用自己的ret
jmp.cleanu
curr=count&1; // and esi, 1
buf += curr; // lea [rdi], [rdi + rsi*4]
prev= 1 ^ curr; // xor eax, esi
curr = 1;
prev = count & 1;
buf += count & 1;
;; loop prologue for sequence starting with 1 1 2 3
;; (using different regs and optimized for size by using fewer immediates)
mov eax, 1 ; current = 1
cmp esi, eax
jb .early_out ; count below 1
mov [rdi], eax
je .early_out ; count == 1, flags still set from cmp
lea rdx, [rdi + rsi*4] ; endp
and esi, eax ; prev = count & 1
lea rdi, [rdi + rsi*4] ; buf += count & 1
;; eax:curr esi:prev rdx:endp rdi:buf
;; end of old code
;; loop prologue for sequence starting with 0 1 1 2
cmp esi, 1
jb .early_out ; count below 1, no stores
mov [rdi], 0 ; store first element
je .early_out ; count == 1, flags still set from cmp
lea rdx, [rdi + rsi*4] ; endp
mov eax, 1 ; prev = 1
and esi, eax ; curr = count&1
lea rdi, [rdi + rsi*4] ; buf += count&1
xor eax, esi ; prev = 1^curr
;; ESI:curr EAX:prev (opposite of other setup)
;;
;; optimized for code size, NOT speed. Prob. could be smaller, esp. if we want to keep the loop start aligned, and jump between before and after it.
;; most of the savings are from avoiding mov reg, imm32,
;; and from counting down the loop counter, instead of checking an end-pointer.
;; loop prologue for sequence starting with 0 1 1 2
xor edx, edx
cmp esi, 1
jb .early_out ; count below 1, no stores
mov [rdi], edx ; store first element
je .early_out ; count == 1, flags still set from cmp
xor eax, eax ; movzx after setcc would be faster, but one more byte
shr esi, 1 ; two counts per iteration, divide by two
;; shift sets CF = the last bit shifted out
setc al ; curr = count&1
setnc dl ; prev = !(count&1)
lea rdi, [rdi + rax*4] ; buf+= count&1
;; extra uop or partial register stall internally when reading eax after writing al, on Intel (except P4 & silvermont)
;; EAX:curr EDX:prev (same as 1 1 2 setup)
;; even count: loop starts at buf[0], with curr=0, prev=1
;; odd count: loop starts at buf[1], with curr=1, prev=0
.loop:
...
dec esi ; 1B smaller than 64b cmp, needs count/2 in esi
jnz .loop
.early_out:
ret
a = [f3 f2 f1 f0 ] -> store this to buf
b = [f2 f1 f0 f-1]
a = [f7 f6 f5 f4 ] -> store this to buf
b = [f6 f5 f4 f3 ]
;void fib64_sse(uint64_t *dest, uint32_t count);
; using SSE for fewer but larger stores, and for 64bit integers even in 32bit mode
global fib64_sse
fib64_sse:
mov eax, 1
movd xmm1, eax ; xmm1 = [0 1] = [f0 f-1]
pshufd xmm0, xmm1, 11001111b ; xmm0 = [1 0] = [f1 f0]
sub esi, 2
jae .entry ; make the common case faster with fewer branches
;; could put the handling for count==0 and count==1 right here, with its own ret
jmp .cleanup
align 16
.loop: ; do {
paddq xmm0, xmm1 ; xmm0 = [ f3 f2 ]
.entry:
;; xmm1: [ f0 f-1 ] ; on initial entry, count already decremented by 2
;; xmm0: [ f1 f0 ]
paddq xmm1, xmm0 ; xmm1 = [ f4 f3 ] (or [ f2 f1 ] on first iter)
movdqu [rdi], xmm0 ; store 2nd last compute result, ready for cleanup of odd count
add rdi, 16 ; buf += 2
sub esi, 2
jae .loop ; } while((count-=2) >= 0);
.cleanup:
;; esi <= 0 : -2 on the count=0 special case, otherwise -1 or 0
;; xmm1: [ f_rc f_rc-1 ] ; rc = count Rounded down to even: count & ~1
;; xmm0: [ f_rc+1 f_rc ] ; f(rc+1) is the value we need to store if count was odd
cmp esi, -1
jne .out ; this could be a test on the Parity flag, with no extra cmp, if we wanted to be really hard to read and need a big comment explaining the logic
;; xmm1 = [f1 f0]
movhps [rdi], xmm1 ; store the high 64b of xmm0. There is no integer version of this insn, but that doesn't matter
.out:
ret
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <stdlib.h>
#ifdef USE32
void fib(uint32_t *buf, uint32_t count);
typedef uint32_t buftype_t;
#define FMTx PRIx32
#define FMTu PRIu32
#define FIB_FN fib
#define CANARY 0xdeadbeefUL
#else
void fib64_sse(uint64_t *buf, uint32_t count);
typedef uint64_t buftype_t;
#define FMTx PRIx64
#define FMTu PRIu64
#define FIB_FN fib64_sse
#define CANARY 0xdeadbeefdeadc0deULL
#endif
#define xstr(s) str(s)
#define str(s) #s
int main(int argc, const char *argv[]) {
uint32_t count = 15;
if (argc > 1) {
count = atoi(argv[1]);
}
int benchmark = argc > 2;
buftype_t buf[count+1]; // allocated on the stack
// Fib overflows uint32 at count = 48, so it's not like a lot of space is useful
buf[count] = CANARY;
// uint32_t count = sizeof(buf)/sizeof(buf[0]);
if (benchmark) {
int64_t reps = 1000000000 / count;
for (int i=0 ; i<=reps ; i++)
FIB_FN(buf, count);
} else {
FIB_FN(buf, count);
for (uint32_t i ; i < count ; i++){
printf("%" FMTu " ", buf[i]);
}
putchar('\n');
}
if (buf[count] != CANARY) {
printf(xstr(FIB_FN) " wrote past the end of buf: sentinel = %" FMTx "\n", buf[count]);
}
}
/* lucas sequence method */
uint64_t fibl(int n) {
uint64_t a, b, p, q, qq, aq;
a = q = 1;
b = p = 0;
while(1){
if(n & 1) {
aq = a*q;
a = b*q + aq + a*p;
b = b*p + aq;
}
n >>= 1;
if(n == 0)
break;
qq = q*q;
q = 2*p*q + qq;
p = p*p + qq;
}
return b;
}
.386
.model flat, stdcall
.stack 4096
ExitProcess proto, dwExitCode:dword
.data
fib word 1, 1, 5 dup(?);you create an array with the number of the fibonacci series that you want to get
.code
main proc
mov esi, offset fib ;set the stack index to the offset of the array.Note that this can also be set to 0
mov cx, lengthof fib ;set the counter for the array to the length of the array. This keeps track of the number of times your loop will go
L1: ;start the loop
mov ax, [esi]; move the first element to ax ;move the first element in the array to the ax register
add ax, [esi + type fib]; add the second element to the value in ax. Which gives the next element in the series
mov[esi + 2* type fib], ax; assign the addition to the third value in the array, i.e the next number in the fibonacci series
add esi, type fib;increment the index to move to the next value
loop L1; repeat
invoke ExitProcess, 0
main endp
end main