Concurrency SSE说明：哪些CPU可以执行原子16B内存操作？_Concurrency_X86_Thread Safety_Atomic_Sse

Concurrency SSE说明：哪些CPU可以执行原子16B内存操作？

concurrency x86

Concurrency SSE说明：哪些CPU可以执行原子16B内存操作？,concurrency,x86,thread-safety,atomic,sse,Concurrency,X86,Thread Safety,Atomic,Sse,考虑x86 CPU上的单内存访问（单读或单写，而不是读+写）SSE指令。指令正在访问16字节（128位）的内存，访问的内存位置与16字节对齐文档“英特尔®64体系结构内存订购白皮书”指出，对于“读取或写入地址在8字节边界上对齐的四字（8字节）的指令”，无论内存类型如何，内存操作都将作为单个内存访问执行问题是：是否存在Intel/AMD/etc x86 CPU，它们可以保证读取或写入与16字节边界对齐的16字节（128位）作为单个内存访问执行？是这样的，它是哪种特定类型的CPU（Core2/A

考虑x86 CPU上的单内存访问（单读或单写，而不是读+写）SSE指令。指令正在访问16字节（128位）的内存，访问的内存位置与16字节对齐

文档“英特尔®64体系结构内存订购白皮书”指出，对于“读取或写入地址在8字节边界上对齐的四字（8字节）的指令”，无论内存类型如何，内存操作都将作为单个内存访问执行

问题是：是否存在Intel/AMD/etc x86 CPU，它们可以保证读取或写入与16字节边界对齐的16字节（128位）作为单个内存访问执行？是这样的，它是哪种特定类型的CPU（Core2/Atom/K8/Phenom/…）？如果您对此问题提供了答案（是/否），请同时指定用于确定答案的方法——PDF文档查找、暴力测试、数学证明或用于确定答案的任何其他方法

这个问题涉及以下问题：

更新：

我用C语言创建了一个简单的测试程序，可以在计算机上运行。请在您的Phenom、Athlon、Bobcat、Core2、Atom、Sandy Bridge或任何您碰巧拥有的支持SSE2的CPU上编译并运行它。谢谢

// Compile with:
//   gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.

#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;

unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));

void* thread1(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n1[mask]++;

                x = (v4si){0,0,0,0};
        }
        return NULL;
}

void* thread2(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n2[mask]++;

                x = (v4si){-1,-1,-1,-1};
        }
        return NULL;
}

int main() {
        // Check memory alignment
        if ( (((uintptr_t)&x) & 0x0f) != 0 )
                abort();

        memset(n1, 0, sizeof(n1));
        memset(n2, 0, sizeof(n2));

        pthread_t t1, t2;
        pthread_create(&t1, NULL, thread1, NULL);
        pthread_create(&t2, NULL, thread2, NULL);
        pthread_join(t1, NULL);
        pthread_join(t2, NULL);

        for (unsigned i=0; i<16; i++) {
                for (int j=3; j>=0; j--)
                        printf("%d", (i>>j)&1);

                printf("  %10u %10u", n1[i], n2[i]);
                if(i>0 && i<0x0f) {
                        if(n1[i] || n2[i])
                                printf("  Not a single memory access!");
                }

                printf("\n");
        }

        return 0;
}

在现在包含您提到的内存订购白皮书规范的中，第8.2.3.1节中提到，正如您自己所注意到的那样

The Intel-64 memory ordering model guarantees that, for each of the following memory-access instructions, the constituent memory operation appears to execute as a single memory access: • Instructions that read or write a single byte. • Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary. • Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary. • Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary. Any locked instruction (either the XCHG instruction or another read-modify-write instruction with a LOCK prefix) appears to execute as an indivisible and uninterruptible sequence of load(s) followed by store(s) regardless of alignment. 为完整起见，.LC3是包含thread2使用的（-1，-1，-1）向量的静态数据：


.LC3:
        .long   -1
        .long   -1
        .long   -1
        .long   -1
        .ident  "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)"
        .section        .note.GNU-stack,"",@progbits

还要注意，这是AT&T ASM语法，而不是Windows程序员可能更熟悉的英特尔语法。最后，这是使用march=native，这使得GCC更喜欢movap；但这并不重要，如果我使用march=core2，它将使用MOVDQA存储到x，我仍然可以重现故障。

编辑： 在过去的两天里，我在我的三台电脑上做了几次测试，我没有重现任何内存错误，所以我不能说得更准确。也许这个内存错误也依赖于操作系统

编辑： 我是用Delphi编程，不是用C编程，但我应该理解C。所以我翻译了代码，这里是线程程序，其中主要部分是在汇编程序中完成的：

程序TThread1.执行；
变量
n：红衣主教；
常数
ConstAll0:integer=（0,0,0,0）的数组[0..3]；
开始
对于n:=0到100000000 do
asm
movdqa xmm0，dqword[x]
movmskps eax，xmm0
inc dword ptr[n1+eax*4]
movdqu xmm0，dqword[ConstAll0]
movdqa-dqword[x]，xmm0
结束；
结束；
{TThread2}
程序TThread2.Execute；
变量
n：红衣主教；
常数
ConstAll1:integer=-1，-1，-1的数组[0..3]；
开始
对于n:=0到100000000 do
asm
movdqa xmm0，dqword[x]
movmskps eax，xmm0
inc dword ptr[n2+eax*4]
movdqu xmm0，dqword[ConstAll1]
movdqa-dqword[x]，xmm0
结束；
结束；

结果：在我的四核电脑上没有错误，在双核电脑上也没有错误

配备英特尔奔腾4处理器的PC

采用Intel Core2四CPU Q6600的PC

配备Intel Core2 Duo CPU P8400的PC

您能展示一下调试程序是如何看到您的线程过程代码的吗？请……
英特尔体系结构手册第3A卷中实际上有一条警告。第8.1.1节（2011年5月），在保证原子能运行一节下：
访问较大数据的x87指令或SSE指令而四字可以使用多个内存访问来实现。如果这样的指令存储到内存中，一些访问可能会当另一个导致操作停止时，完成（写入内存）由于架构原因（例如，由于页面表条目标记为“不在场”）。在这种情况下，已完成的即使整个系统指令导致了故障。如果TLB失效已延迟（请参阅第4.10.4.4）节），即使所有访问都已关闭，也可能出现此类页面错误到同一页
因此，即使底层体系结构使用单一内存访问，也不能保证SSE指令是原子指令（这就是引入内存隔离的原因之一）
结合《英特尔优化手册》第13.3节（2011年4月）中的这句话
AVX和FMA指令不引入任何新的保证原子内存操作
事实上，SIMD的加载或存储操作都不能保证原子性，我们可以得出结论，Intel不支持任何形式的原子SIMD（至今）
作为一个额外的位，如果内存沿缓存线或页面边界分割（当使用允许未对齐访问的
movdqu
时），以下处理器将不会执行原子访问，无论对齐方式如何，但稍后的处理器将执行（同样来自《英特尔体系结构手册》）：
Intel Core 2 Duo，Intel®Atom™, 英特尔酷睿双核、奔腾M、奔腾4、，英特尔至强、P6系列、奔腾和Intel486处理器。英特尔 Core 2 Duo、英特尔Atom、英特尔Core Duo、奔腾M、奔腾4、英特尔 Xeon和P6系列处理器
第3.9.1节中指出：“
CMPXCHG16B
可用于在64位模式下执行16字节原子访问（具有某些对齐限制）。”
然而，对于苏格兰和南方能源公司的指示，没有这样的评论。事实上，4.8.3中有一条评论说，锁定前缀“与128位媒体指令一起使用时，会导致无效操作码异常”。因此，在我看来，AMD处理器并不保证SSE指令的原子128位访问，而实现原子128位访问的唯一方法是使用
CMPXCHG16B
“”在8.1中表示 0000 999998139 1572 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 1861 999998428 0000 999243100 283087 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 756900 999716913 0000 999995893 1901 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 4107 999998099 0000 999998634 5990 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 1 Not a single memory access! 1101 0 0 1110 0 0 1111 1366 999994009
.globl thread2 .type thread2, @function thread2: .LFB537: .cfi_startproc movdqa .LC3(%rip), %xmm1 xorl %eax, %eax .p2align 5,,24 .p2align 3 .L11: movaps x(%rip), %xmm0 incl %eax movaps %xmm1, x(%rip) movmskps %xmm0, %edx movslq %edx, %rdx incl n2(,%rdx,4) cmpl $1000000000, %eax jne .L11 xorl %eax, %eax ret .cfi_endproc .LFE537: .size thread2, .-thread2 .p2align 5,,31 .globl thread1 .type thread1, @function thread1: .LFB536: .cfi_startproc pxor %xmm1, %xmm1 xorl %eax, %eax .p2align 5,,24 .p2align 3 .L15: movaps x(%rip), %xmm0 incl %eax movaps %xmm1, x(%rip) movmskps %xmm0, %edx movslq %edx, %rdx incl n1(,%rdx,4) cmpl $1000000000, %eax jne .L15 xorl %eax, %eax ret .cfi_endproc

.LC3: .long -1 .long -1 .long -1 .long -1 .ident "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)" .section .note.GNU-stack,"",@progbits