Assembly ASM块转换为函数和ABI x86-64_Assembly_X86 64_Abi

Assembly ASM块转换为函数和ABI x86-64

assembly

Assembly ASM块转换为函数和ABI x86-64,assembly,x86-64,abi,Assembly,X86 64,Abi,我为大整数编写了一个非常好的整数库，但限制为512位（由于各种原因比GMP快）。我试图将lib推广到大尺寸。所以我必须循环一条adcq指令 // long addition little indian order due the technique incq-jnz // I can not use compare because it destroy the Carry Bit template<int n> void test_add(boost::uint64_t*, boos

我为大整数编写了一个非常好的整数库，但限制为512位（由于各种原因比GMP快）。我试图将lib推广到大尺寸。所以我必须循环一条adcq指令

// long addition little indian order due the technique incq-jnz
// I can not use compare because it destroy the Carry Bit
template<int n>
void test_add(boost::uint64_t*, boost::uint64_t* ){    
    asm volatile (
        "clc                                     \n"
        "movq %0, %%rcx                          \n"
    "loop:                                       \n"
        "movq 8(%%rsi,%%rcx,8), %%rax            \n"  /* original -8(%%rsi,%%rbx,8) */
        "adcq %%rax           , 8(%%rdi,%%rcx,8) \n"  /* original -8(%%rsi,%%rbx,8) */
        "incq %%rcx                              \n"  /* original decq */
    "jnz loop                                    \n"
        :   
        :"g"(n)
        :"rax","rcx","cc","memory"
    );  
}


int main(int argc, char* argv[]) {
boost::uint64_t c[4],d[4];

c[0] = -1; 
c[1] = -1; 
c[2] = -1; 
c[3] =  0;  

d[0] = 1;
d[1] = 0;
d[2] = 0;
d[3] = 0;

test_add<-4>(&d[3],&c[3]); // <-- BigEndian to LittleEndian

第20-30行我们看到编译器重新组织堆栈，将arg传递给rsi和rdi （第29-30行）和呼叫。就像在ABI中一样完美

如果现在我看看我得到的优化版本

  1         .file   "test.cpp"
  2         .text
  3         .p2align 4,,15
  4 .globl main
  5         .type   main, @function
  6 main:
  7 .LFB1:
  8         .cfi_startproc
  9         .cfi_personality 0x3,__gxx_personality_v0
  10 #APP
  11 # 14 "test.cpp" 1
  12         clc
  13 movq $-4, %rcx
  14 loop:
  15 movq 8(%rsi,%rcx,8), %rax
  16 adcq %rax           , 8(%rdi,%rcx,8)
  17 incq %rcx
  18 jnz loop
  19 
  20 # 0 "" 2
  21 #NO_APP
  22         xorl    %eax, %eax
  23         ret
  24         .cfi_endproc
  25 .LFE1:
  26         .size   main, .-main
  27         .ident  "GCC: (GNU) 4.4.6 20120305 (Red Hat 4.4.6-4)"
  28         .section        .note.GNU-stack,"",@progbits

再见了，ABI，我不明白。堆栈由什么管理

ASM大师有什么想法？我拒绝将函数放入一个独立的文件中，并具有良好的元编程精神

干杯

-------编辑：

我在您的解决方案中发现了一个bug，如果我将其放入单个循环：

#包括//增压类型
模板
无效测试添加（boost:：uint64\u t*x，boost:：uint64\u t const*y）{
boost：：uint64_t dummy；
boost：：uint64_t loop_index（n）；
__asm\uuuuuu挥发性(
“clc\n\t”
“1:\n\t”
“movq（%[y]，%[counter]，8），%[dummy]\n\t”
adcq%[虚拟]，（[x]，%[计数器]，8）\n\t
“incq%[计数器]\n\t”
“jnz 1b\n\t”
：[虚拟]“=&r”（虚拟）
：[x]“r”（x），[y]“r”（y），[counter]“r”（循环索引）
：“记忆”，“抄送”）；
}
int main（int argc，char*argv[]）{
boost：：uint64_t c[3]，d[3]；
c[0]=-1；
c[1]=-1；
c[2]=-1；
c[3]=0；
d[0]=1；
d[1]=0；
d[2]=0；
d[3]=0；
对于（int i=0；i<0xfff；++i）
测试添加（&c[4]，&d[4]）；
返回0；

}

将提供以下ASM：

movq$-4，%rdx您应该使用约束来访问参数<代码>gcc

不需要遵循内部函数的ABI，即使它遵循，也不需要在asm块执行时保持初始状态完整。当然，内联asm的要点是让编译器内联它，然后甚至不会发生函数调用。（许多人错误地认为内联意味着“嵌入到C源文件中”，甚至在不需要实际代码内联的情况下也将其用作方便的特性。）

gcc

也能够将东西放入您想要的寄存器中（这里您并不特别关心计数器是

rcx

）。通常，将尽可能多的工作留给编译器也是一个好主意，这样它就可以进行寄存器分配、循环展开和其他优化。不幸的是，我无法让

gcc

生成

ADC

，所以这次asm块保持不变。由于部分标志更新，也不建议使用

inc

，但我现在看不到明显的解决方法

最后，如果您传递

d[3]

的地址，您将通过

d[2]

访问项目

d[-1]

，这不是您想要的。您应该通过

d[4]

固定版本可能如下所示（带有命名参数）：

模板
无效测试添加（boost:：uint64\u t*x，boost:：uint64\u t*y）{
boost：：uint64_t dummy，dummy2；
__asm\uuuuuu挥发性(
“clc\n\t”
“1:\n\t”
“movq（%[y]，%[counter]，8），%[dummy]\n\t”
adcq%[虚拟]，（[x]，%[计数器]，8）\n\t
“incq%[计数器]\n\t”
“jnz 1b\n\t”
：[dummy]“=&r”（dummy），“=r”（dummy2）
：[x]“r”（x），[y]“r”（y），[counter]“1”（n）
：“记忆”，“抄送”）；
}

请注意，

dummy

变量将被优化掉，同时允许

gcc

选择合适的寄存器，而不是强制它使用特定的寄存器

<强>更新：这里是一个纯C++版本，编译器可以完全展开并进行优化（包括编译时的计算）。虽然在一般情况下，编译器的代码不如手工编写的代码效率高，但所提到的优化可能会使它在某些情况下变得更好。注意：由于您使用的是

gcc

内联asm，这意味着您的代码已经是

gcc

和

x86-64

特定的，因此使用

\uu uint128\u t

不是进一步的限制（事实上，这将适用于

gcc

支持128位整数的任何体系结构）

模板
无效测试添加（boost:：uint64\u t*x，boost:：uint64\u t*y）{
__uint128_t工作=0；
对于（长i=n；i<0；i+=1）{
功=功+x[i]+y[i]；
x[i]=功；//自动截断
工作>>=64；
}
}

命名参数是一个伟大的举措：这确实有助于确定代码生成。1）如果插入循环，解决方案不稳定，请再次segfault:/。我只想做的是：adcq x[0]+=y[0]；adcq x[1]+=y[1]；adcq x[2]+=y[2]；adcq x[3]+=y[3]；这就是我使用incq的原因。目前，我手工在.cpp文件中编写所有内核，直到x86-64、power64的512位。我有很多行，它不是一个完整的内联解决方案。2）对于泛型，我还有一个。我不小心漏掉了

CLC

，很抱歉。不应该引起一个错误，但我从来没有得到任何。您确定错误在那里，而不是由编译器拾取特定寄存器集在其他地方触发的吗？您是否使用了调试器来查明问题？对于clc，我在^^^之前已更正，仍在调查中。如果我从零开始，是的，它是有效的。现在我插入我的代码，砰。我正在调查。干杯，非常感谢这个优雅的解决方案，通过您提出d[4]的方式，认为d是boost:：uint64_t d[3]的定义是不可能的；

  1         .file   "test.cpp"
  2         .text
  3         .p2align 4,,15
  4 .globl main
  5         .type   main, @function
  6 main:
  7 .LFB1:
  8         .cfi_startproc
  9         .cfi_personality 0x3,__gxx_personality_v0
  10 #APP
  11 # 14 "test.cpp" 1
  12         clc
  13 movq $-4, %rcx
  14 loop:
  15 movq 8(%rsi,%rcx,8), %rax
  16 adcq %rax           , 8(%rdi,%rcx,8)
  17 incq %rcx
  18 jnz loop
  19 
  20 # 0 "" 2
  21 #NO_APP
  22         xorl    %eax, %eax
  23         ret
  24         .cfi_endproc
  25 .LFE1:
  26         .size   main, .-main
  27         .ident  "GCC: (GNU) 4.4.6 20120305 (Red Hat 4.4.6-4)"
  28         .section        .note.GNU-stack,"",@progbits