Assembly 顺序和障碍：x86上针对'的等效指令是什么；lwsync'；在PowerPC上？_Assembly_Synchronization_X86_Powerpc_Memory Barriers

Assembly 顺序和障碍：x86上针对'的等效指令是什么；lwsync'；在PowerPC上？

assembly synchronization x86

Assembly 顺序和障碍：x86上针对'的等效指令是什么；lwsync'；在PowerPC上？,assembly,synchronization,x86,powerpc,memory-barriers,Assembly,Synchronization,X86,Powerpc,Memory Barriers,我的代码如下所示。我找到了用于读写的rmb和wmb，但没有找到通用的。lwsync在PowerPC上可用，但x86的替代品是什么？提前感谢 #define barrier() __asm__ volatile ("lwsync") ... lock() if(!pInst); { T* temp=new T; barrier(); pInst=temp; } unlock(); rmb（）和wmb（）是Li

我的代码如下所示。我找到了用于读写的rmb和wmb，但没有找到通用的。lwsync在PowerPC上可用，但x86的替代品是什么？提前感谢

#define barrier() __asm__ volatile ("lwsync")
...
    lock()
    if(!pInst);
    {
        T* temp=new T;
        barrier();
        pInst=temp;
    }
    unlock();

rmb（）

和wmb（）是Linux内核函数。还有

mb（）

x86指令是

lfence

，

sfence

和

mfence

，IIRC。

在Cilk运行时中，您可能会发现一个特别的文件，即Cilk sysdep.h，其中包含特定于系统的映射w.r.t内存屏障。我摘录了一小部分关于x86即i386的w.r.t问题

file:-- cilk-sysdep.h (the numbers on the LHS are actually line numbers) 252 * We use an xchg instruction to serialize memory accesses, as can 253 * be done according to the Intel Architecture Software Developer's 254 * Manual, Volume 3: System Programming Guide 255 * (http://www.intel.com/design/pro/manuals/243192.htm), page 7-6, 256 * "For the P6 family processors, locked operations serialize all 257 * outstanding load and store operations (that is, wait for them to 258 * complete)." The xchg instruction is a locked operation by 259 * default. Note that the recommended memory barrier is the cpuid 260 * instruction, which is really slow (~70 cycles). In contrast, 261 * xchg is only about 23 cycles (plus a few per write buffer 262 * entry?). Still slow, but the best I can find. -KHR 263 * 264 * Bradley also timed "mfence", and on a Pentium IV xchgl is still quite a bit faster 265 * mfence appears to take about 125 ns on a 2.5GHZ P4 266 * xchgl apears to take about 90 ns on a 2.5GHZ P4 267 * However on an opteron, the performance of mfence and xchgl are both *MUCH MUCH BETTER*. 268 * mfence takes 8ns on a 1.5GHZ AMD64 (maybe this is an 801) 269 * sfence takes 5ns 270 * lfence takes 3ns 271 * xchgl takes 14ns 272 * see mfence-benchmark.c 273 */ 274 int x=0, y; 275 __asm__ volatile ("xchgl %0,%1" :"=r" (x) :"m" (y), "0" (x) :"memory"); 276 } 文件：--cilk sysdep.h（LHS上的数字实际上是行号） 252*我们使用一条xchg指令来序列化内存访问 253*根据英特尔体系结构软件开发人员的 254*手册，第3卷：系统编程指南 255 * (http://www.intel.com/design/pro/manuals/243192.htm)，第7-6页， 256*“对于P6系列处理器，锁定操作序列化所有 257*未完成的加载和存储操作（即等待 258*完成）。“xchg指令是由 259*默认值。请注意，推荐的内存屏障是cpuid 260*指令，非常慢（~70个周期）。相反， 261*xchg只有大约23个周期（加上每个写入缓冲区几个周期 262*进入？）。仍然很慢，但我能找到最好的-KHR 263 * 264*布拉德利还对“mfence”进行了计时，在奔腾IV上，xchgl仍然要快一点 265*mfence在2.5GHZ P4上似乎需要约125纳秒 266*xchgl apears在2.5GHZ P4上大约需要90纳秒 267*但在opteron上，mfence和xchgl的性能都*好得多*。 268*mfence在1.5GHZ AMD64上需要8纳秒（可能这是801） 269*sfence需要5ns 270*lfence需要3ns 271*xchgl需要14纳秒 272*参见mfence benchmark.c 273 */ 274 int x=0，y； 275易失性（“xchgl%0，%1”：“=r”（x）：“m”（y），“0”（x）：“内存”）； 276 }

我喜欢的是，xchgl似乎更快：）尽管您应该真正实现它们并检查它。

您没有确切说明此代码中的锁定和解锁是什么。我想它们是互斥操作。在powerpc上，互斥体获取函数将使用isync（如果没有isync，硬件可能会在lock（）之前计算if（！pInst）），并且在unlock（）中具有lwsync（如果互斥体实现很古老，则为sync）

因此，假设您对pInst的所有访问（读写）都由锁定和解锁方法保护，那么您的屏障使用是多余的。解锁将具有足够的屏障，以确保在解锁操作完成之前可以看到pInst存储（因此，假定使用了相同的锁，则在任何后续的锁获取之后都可以看到该存储）

在x86和x64上，lock（）将使用某种形式的带锁前缀的指令，该指令自动具有双向保护行为

您在x86和x64上的解锁只需是存储指令（除非您在CS中使用一些特殊的字符串指令，在这种情况下，您需要一个SFENCE）

手册：

有关于所有围栏以及锁前缀效果的良好信息（以及何时暗示）

另外，在你的解锁代码中，你还必须有一些强制编译器排序的东西（因此，如果它只是一个零存储，你还需要一些类似GCC风格的asmasm\u volatile（“内存”）。

rmb（）wmb是汇编代码中的宏，而不是函数。我只想看看如果没有设置障碍，gcc将如何优化它。如果你想特别偏执，可以使用

asm volatile（“任意”：：：内存），它告诉GCC任意内存地址可能已被删除。我认为，如果GCC将加载缓存在寄存器中，那么发出指令并不一定足够。在P6上更快？mfence比AMD64上的xchg快。