Assembly 错误A2070:“；无效的指令操作数"；用于MOVSD（SSE2）我正在学习汇编，以提高我的C++效率，并且尝试用SIMD指令编写一个向量库，但是我需要不时地访问单个元素，并且想知道是否有比使用VXUTFF128和MOVLPD/MOVHPD：< /P>更容易的方法来完成它。 .Data vecta STRUCT 16 x REAL8 ? y REAL8 ? z REAL8 ? w REAL8 ? vecta ENDS vectb UNION ;If I understand correctly this will force anything in a to be in b as well a YMMWORD ? ;since they share the same space b vecta {?,?,?,?} vectb ENDS .CODE Somefunc PROC ;uses _vectorcall convention and has one parameter to be passed in YMM0 VMOVAPD [vectb.a], YMM0 MOVSD XMM2, [vectb.b.x] ;this gives the error ; make other changes to vectb VMOVAPD YMM0, [vectb.a] RET Somefunc ENDP_Assembly_X86_Sse_Masm

Assembly 错误A2070:“；无效的指令操作数"；用于MOVSD（SSE2）我正在学习汇编，以提高我的C++效率，并且尝试用SIMD指令编写一个向量库，但是我需要不时地访问单个元素，并且想知道是否有比使用VXUTFF128和MOVLPD/MOVHPD：< /P>更容易的方法来完成它。 .Data vecta STRUCT 16 x REAL8 ? y REAL8 ? z REAL8 ? w REAL8 ? vecta ENDS vectb UNION ;If I understand correctly this will force anything in a to be in b as well a YMMWORD ? ;since they share the same space b vecta {?,?,?,?} vectb ENDS .CODE Somefunc PROC ;uses _vectorcall convention and has one parameter to be passed in YMM0 VMOVAPD [vectb.a], YMM0 MOVSD XMM2, [vectb.b.x] ;this gives the error ; make other changes to vectb VMOVAPD YMM0, [vectb.a] RET Somefunc ENDP

assembly x86

Assembly 错误A2070:“；无效的指令操作数"；用于MOVSD（SSE2）我正在学习汇编，以提高我的C++效率，并且尝试用SIMD指令编写一个向量库，但是我需要不时地访问单个元素，并且想知道是否有比使用VXUTFF128和MOVLPD/MOVHPD：< /P>更容易的方法来完成它。 .Data vecta STRUCT 16 x REAL8 ? y REAL8 ? z REAL8 ? w REAL8 ? vecta ENDS vectb UNION ;If I understand correctly this will force anything in a to be in b as well a YMMWORD ? ;since they share the same space b vecta {?,?,?,?} vectb ENDS .CODE Somefunc PROC ;uses _vectorcall convention and has one parameter to be passed in YMM0 VMOVAPD [vectb.a], YMM0 MOVSD XMM2, [vectb.b.x] ;this gives the error ; make other changes to vectb VMOVAPD YMM0, [vectb.a] RET Somefunc ENDP,assembly,x86,sse,masm,Assembly,X86,Sse,Masm,我还设置了/arch:SSE2编译器选项，但这似乎没有帮助。我尝试过的其他事情： Somefunc PROC VMOVAPD [vecta.x],YMM0 ; compiler seems to think this is ok MOVSD XMM2, [vecta.x]; as this line is still the only error ... Somefunc ENDP 以及：您似乎需要创建一个vectb变量： .Data ... ... vectc

我还设置了/arch:SSE2编译器选项，但这似乎没有帮助。我尝试过的其他事情：

Somefunc PROC
    VMOVAPD [vecta.x],YMM0 ; compiler seems to think this is ok
    MOVSD   XMM2, [vecta.x]; as this line is still the only error
    ...
Somefunc ENDP

以及：

您似乎需要创建一个vectb变量：

.Data
...
...
vectc vectb {?}

.CODE
Somefunc PROC
    VMOVAPD   [vectc.a]  ,   YMM0
    MOVSD     XMM2       ,   [vectc.b.x]
    ...
Somefunc ENDP

我正在尝试使用SIMD指令编写向量库。。。提高我的C++效率

下面是基于此的代码审查。我希望这有助于提高代码的效率和质量

正如英特尔在中解释的那样，混合使用VEX编码指令和非VEX指令是一个严重的性能缺陷。使用您想要执行的任何其他128b操作的

vmovsd

和

版本，除非自上次使用256b指令以来运行了

vzeropper

有关编写高效x86 asm的更多信息，请参阅指南。里面有很多好东西：

如何根据特定微体系结构的性能特征决定使用哪些指令
如何在向量中重新排列数据。有一整套的表格，比如：“组合两个向量数据的指令”，或者“可以在一个向量中广播的指令”
如何从依赖链、延迟和吞吐量方面考虑asm优化
如何处理Windows和其他任何东西之间的ABI差异
说明表和详细的微通道信息

有关更多链接，请参见标记wiki

我需要能够不时地访问各个元素，我想知道是否有比使用VextractF128和Movlpd/Movhpd更简单的方法

是的，但是慢一点。为了达到最大性能，你（或C++编译器）一般需要使用洗牌指令，而不是将内存重新加载到内存中。movlpd/movhpd仅作为存储/加载工作，不在寄存器之间工作。但是您可以使用

movhlps

将一个寄存器的高位元素中的64位合并到另一个寄存器的低位元素中

溢出到内存，然后重新加载和修改该内存会有很大的延迟（比如每次内存往返5个周期）。然后，刚使用多个窄存储写入的内存中的宽向量负载将遭受存储转发故障，导致另约10个延迟周期

因此，即使

Somefunc

只是存储、重新加载标量、再次存储标量、重新加载向量，它也会在Intel Haswell上为涉及其输入/输出的依赖链引入大约20个延迟IIRC周期

不要存储/重新加载以获取低位元素（

.x

）：它已经是整个向量的低位元素，您可以将其直接用于

vmulsd

或其他任何东西

e、你应该用

Somefunc PROC   ;uses _vectorcall convention and has one parameter to be passed in YMM0


    ;; VMOVAPD    [vectb.a], YMM0    ; don't do this, it was a bad plan

    ; MOVSD   XMM2, [vectb.b.x]  ;this gives the error
    ;; should be:
    vmovapd    xmm2, xmm0    ; the low element of xmm2 now contains the low element of xmm0.   The high128 of ymm2 is zeroed (instead of preserved like movapd would).
    ; or better: don't even copy it at all.  You can use `xmm0` as a source operand for `v...sd` scalar instructions just fine.


    ;;; Or, if you needed the high double zeroed, use
    vxorps     xmm3, xmm3, xmm3        ; zero ymm3 (not a typo: upper 128 zeroed implicitly).
    vmovsd     xmm2, xmm3, xmm0        ; merge low double of xmm0 into the all-zeros, putting the result in xmm2 while keeping our all-zeros around for future use.

    ;; get  .y:
    vmovhlps   xmm1, xmm3, xmm0        ; merge the high 64b of xmm0 with all-zeros, putting the result in xmm1

    vextractf128  xmm4, ymm0, 1        ; .z in the low element of xmm4, garbage in the high element)

    vmovhlps   xmm5, xmm3, xmm4        ; .w in the low element, zero in the high element


    ; make other changes to vectb


    ;; re-combine with unpcklpd to combine two scalars into the same vector
    ;; and vinsertf128

    ;; Storing and re-loading is not a good plan for re-combining either.
    ;; VMOVAPD    YMM0, [vectb.a]     ; store-forwarding failure here
    RET

您的结构/联合声明：你可能不需要工会。这是汇编语言，只需显式设置操作数大小，告诉MASM您不希望它根据定义标签的方式抱怨操作数大小不匹配

e、 g.

vmovapd-ymmword ptr[你想要什么]，ymm0

更重要的是，使用这样的静态缓冲区会使函数不具有线程安全性。如果需要临时空间，应在堆栈上为其保留空间。使其32B对齐，如下所示：

;; Usually compilers will actually align the stack pointer to 32B
;; but if you can spare another integer register, I think you save insns doing this.
lea    rdx, [rsp-32]
sub    rsp, 48           ; assumes RSP was 16B-aligned
and    rdx, -32          ; Same as ~0x0f

RDX现在指向一个32B对齐的堆栈空间块，如果事先rsp是16B对齐的，则该块位于[rsp]或[rsp+16]。如果您不知道这一点，并且可以将RDX降到RSP以下，如果没有红色区域，这将是不安全的。（Windows没有，其他一切都有）。在这种情况下，

sub rsp，64

我相信当您移动[vectb.b.x]时，您传递了错误的长度值。您可以移动[vectb.b]（一个mword）而不是它的子集x（一个字或dword）。@DavidBS谢谢您，但它仍然会给出一个错误：“表达式中的语法错误”[A2009]。我更新了我的问题来说明这一点。您发布的向量指令都不应该正确组装，因为它们都使用直接操作数。符号vectb`是一种类型，它不引用内存中的某个位置。因此

vectb.a

和

vectb.b.x

计算为0，即成员的偏移量。发布再现问题的实际代码。@RossRidge这是实际代码，正如我前面提到的，我正在学习。不，如果这是您的实际代码，汇编程序将给出比您描述的更多的错误。除了为所有向量指令给出一个错误之外，即使是你说的“编译器似乎认为这是可以的”，你也会得到错误，因为你的代码缺少

.model

指令和

END

指令。您还没有发布能够再现您描述的问题的代码。

;; Usually compilers will actually align the stack pointer to 32B
;; but if you can spare another integer register, I think you save insns doing this.
lea    rdx, [rsp-32]
sub    rsp, 48           ; assumes RSP was 16B-aligned
and    rdx, -32          ; Same as ~0x0f