Assembly 两个数组的总和，每个数组有n个字节_Assembly_Emu8086_X86 16_X86

Assembly 两个数组的总和，每个数组有n个字节

assembly x86

Assembly 两个数组的总和，每个数组有n个字节,assembly,emu8086,x86-16,x86,Assembly,Emu8086,X86 16,X86,这是我的解决方案，我想知道它是否正确，还有什么解决方法 include 'emu8086.inc' org 100h mov CX,n lea SI,a mov AL,0 start_sum_a: ;sum all the n elements of the first array add AL, [SI] inc SI loop start_sum_a mov CX,n lea SI,b start_sum_b: ;sum all the n elem

这是我的解决方案，我想知道它是否正确，还有什么解决方法

include 'emu8086.inc'
org 100h

mov CX,n
lea SI,a
mov AL,0

start_sum_a:         ;sum all the n elements of the first array
add AL, [SI] 
inc SI
loop start_sum_a

mov CX,n
lea SI,b 

start_sum_b:        ;sum all the n elements of the 2nd array to 
add AL, [SI]        ;the first sum
inc SI
loop start_sum_b

call print_num      ;print the sum

ret

a db 1,3,5,7,9,11,13,15,17,18
b db 0,2,4,6,8,10,12,14,16,19
n dw 10

DEFINE_PRINT_NUM
DEFINE_PRINT_NUM_UNS

我想知道它是否正确

您的解决方案看起来不错，不过我有一种预感，emu8086.inc中定义的函数print_num更希望AX寄存器中的数字。所以最好将mov AL，0指令更改为xor ax，ax，这将清除整个ax寄存器，而不仅仅是它的低字节AL

另一种解决方法是什么

include 'emu8086.inc'
org 100h

mov CX,n
lea SI,a
mov AL,0

start_sum_a:         ;sum all the n elements of the first array
add AL, [SI] 
inc SI
loop start_sum_a

mov CX,n
lea SI,b 

start_sum_b:        ;sum all the n elements of the 2nd array to 
add AL, [SI]        ;the first sum
inc SI
loop start_sum_b

call print_num      ;print the sum

ret

a db 1,3,5,7,9,11,13,15,17,18
b db 0,2,4,6,8,10,12,14,16,19
n dw 10

DEFINE_PRINT_NUM
DEFINE_PRINT_NUM_UNS

如果为两个数组设置单独的指针，则可以选择在单个循环中执行此工作

    lea  si, a
    lea  di, b
    mov  cx, n
    xor  ax, ax
start_sum:
    add  al, [si]        ;Element of a array
    add  al, [di]        ;Element of b array
    inc  si
    inc  di
    loop start_sum

但是，由于这些数组的起点在内存中相距一定距离10，因此有一种解决方案仅使用一个指针：

    lea  si, a
    mov  cx, n
    xor  ax, ax
start_sum:
    add  al, [si]        ;Element of a array
    add  al, [si + 10]   ;Element of b array
    inc  si
    loop start_sum

最后，由于这些数组在内存中是相邻的，因此循环可以更简单。只需将其中一个建议的迭代次数增加一倍：

另一种解决方法是什么

include 'emu8086.inc'
org 100h

mov CX,n
lea SI,a
mov AL,0

start_sum_a:         ;sum all the n elements of the first array
add AL, [SI] 
inc SI
loop start_sum_a

mov CX,n
lea SI,b 

start_sum_b:        ;sum all the n elements of the 2nd array to 
add AL, [SI]        ;the first sum
inc SI
loop start_sum_b

call print_num      ;print the sum

ret

a db 1,3,5,7,9,11,13,15,17,18
b db 0,2,4,6,8,10,12,14,16,19
n dw 10

DEFINE_PRINT_NUM
DEFINE_PRINT_NUM_UNS

做任何事都有很多方法。有些会比其他的更有效率，对效率有不同的衡量标准。不同的效率度量包括以指令字节为单位的代码大小，或小型阵列或大型阵列的性能。对于真正的8086，代码大小通常是性能的决定因素，但对于现代x86 CPU，这绝对不是事实。有关文档的链接，请参见标记wiki

不需要在内存中存储10；它应该是一个eq常量。IDK，如果你假装你在写一个函数，它没有利用所有的时间常数。如果是这样的话，那么请注意如何使用常数。比如不要写mov-di n+偏移量a，以便在组装时计算结束指针

通过从数组末尾向下计算索引并使用索引寻址模式，可以避免在不增加循环中的指令数的情况下执行此操作

另外，由于数组是相邻的，所以可以只使用一个从a开始到b结束的循环

mov   bx, OFFSET a          ; no point in using LEA for this
mov   si, length_ab - 1     ; index of the last element
xor   ax,ax

sum_loop:              ; do {
add   al, [bx+si]
dec   si
jg  sum_loop           ; } while(si > 0)

jmp   print_num        ; tailcall optimization: print_num will return directly to our caller
;call print_num
;ret

section .rodata
a:  db 1,3,5,7,9,11,13,15,17,18
b:  db 0,2,4,6,8,10,12,14,16,19
end_b:                   ; put a label after the end of b
length_ab equ $ - a      ; this is NASM syntax, IDK if emu8086 accepts it
n equ 10

或者利用a是静态的：添加AL、[a+SI]。这在真正的8086上可能会慢一些，因为它会在循环中放入额外的2字节代码，而8086每次都必须重新获取这些代码。在现代CPU上，节省mov bx，偏移一条指令对于总代码大小来说是值得的。如果在一个循环中多次使用同一指针，那么将它放在寄存器中是有意义的

如果你知道你的总和不会溢出一个字节，你可以用add ax，[si]并行处理2个字节，最后加上al，啊。但这绝对是一种特殊情况，处理避免进位到下一个字节的一般情况对于仅使用2字节的字来说不会很好。在386或更高版本的16位代码中，您可以使用32位寄存器并分别屏蔽奇偶字节

在某些超标量CPU（如Intel pre Sandybridge）上，每个时钟周期只能执行一次加载，这将更快，允许您在每个时钟上添加近2个字节：

    xor   ax,ax
    xor   dx,dx
sum_loop:               ; do{
    mov   cx, [si]
    add   al, cl
    add   dl, ch

    add   si, 2
    cmp   si, end_a
    jb  sum_loop        ; } while (si < end_pointer)

    add   al, dl
    ;; mov ah,0   ; if necessary

是的，这将在16位模式下使用NASM进行组装

对于加法来说，在以后而不是在每个步骤之后截断是可以的，因为从低位字节环绕或执行是一样的

如果你不能利用a和b相邻的优势，你可以：

movdqu  xmm0, [a]
movdqu  xmm1, [b]
paddb   xmm0, xmm1  ; add packed bytes (no carry across byte boundaries)
psrldq  xmm0, 6     ; shift out the high 6 bytes from past the end of a and b

甚至避免读取超过数组末尾的内容：

movq    xmm0, [a]
pinsrw  xmm0, [a+8], 4

我刚刚意识到，因为您显然希望将总和包装为8位，所以可以使用paddb来提高效率。对于大型阵列，可以使用paddb进行累加，并在最后执行一个psadbw

movd    xmm1, [a+16]  ; load last 4 bytes, zeroing the rest of the register
paddb   xmm1, [a]
pxor    xmm0, xmm0    ; xmm0 = 0
psadbw  xmm1, xmm0    ; horizontal sum one vector of byte-sums

movhlps xmm0, xmm1    ; extract high half into a different register
paddw   xmm0, xmm1    
movd    eax, xmm1     

movzx    eax, al      ; truncate the sum to 8-bit
jmp    print_num

对不起，我忘了在解决方案和我的陈述之间切换。您的总和应该是255，还是应该在添加到16位累加器之前进行零扩展？是的，当然还有其他的解决方法。例如，对于SSE2，将所有16个字节加载到一个xmm寄存器中，并使用pxor xmm1、xmm1/psadbw xmm0、xmm1。有关完整解决方案，请参阅。在SSE或甚至32位寄存器不可用的情况下，仍然使用8086，如果您知道您的总和不会溢出一个字节，您可以与add ax[si]并行执行2个字节，最后添加al，啊。您还可以使用SI作为循环条件，而不是使用CX作为计数器。而且不需要在内存中存储10；它可以是一个等式常数。另外，由于数组是相邻的，您可以只使用一个从a开始到b结束的循环。非常感谢您，我非常感谢所有这些努力，它确实帮助我更好地使用al和ah或al和dl，并在末尾添加，这样您的代码可以在超标量CPU上运行得更快。仅使用一个累加器展开会失去很多好处。此外，如果a和b之间的偏移量必须是运行时变量，请将其放入寄存器并使用索引寻址模式。add al，[si]/add ah，[si+di]我发布了我自己的答案，因为我在评论中写的内容本来应该作为答案发布。@PeterCordes，这是一个多么好的答案啊。我没想到会发生这样的事

关于这个朴素的问题，我们有很多话要说。美好的我不是SSE方面的专家，但movq eax、xmm1正确吗？我有点期待movd…@SepRoland:谢谢，修好了。我想我的头在那里超过了我的手指。有趣的事实：一些汇编器接受或可能需要movd来进行32位或64位GP XMM数据移动，并将movq用于MMX/SSE load/store指令，这是一个单独的操作码。