Java 通过JMH在sun.misc.Unsafe.compareAndSwap测量中的奇怪行为_Java_Cas_Microbenchmark_Jmh

Java 通过JMH在sun.misc.Unsafe.compareAndSwap测量中的奇怪行为

java

Java 通过JMH在sun.misc.Unsafe.compareAndSwap测量中的奇怪行为,java,cas,microbenchmark,jmh,Java,Cas,Microbenchmark,Jmh,我决定用不同的锁定策略来测量增量，并为此使用JMH。我使用JMH来检查吞吐量和平均时间，并使用简单的定制测试来检查正确性。有六种战略：原子计数读写锁定计数与volatile同步无易失性的同步块 sun.misc.Unsafe.compareAndSwap sun.misc.Unsafe.getandad 不同步计数基准代码： @State(Scope.Benchmark) @BenchmarkMode({Mode.Throughput, Mode.AverageTime}) @

我决定用不同的锁定策略来测量增量，并为此使用JMH。我使用JMH来检查吞吐量和平均时间，并使用简单的定制测试来检查正确性。有六种战略：

原子计数
读写锁定计数
与volatile同步
无易失性的同步块
sun.misc.Unsafe.compareAndSwap
sun.misc.Unsafe.getandad
不同步计数

基准代码：

@State(Scope.Benchmark)
@BenchmarkMode({Mode.Throughput, Mode.AverageTime})
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(1)
@Warmup(iterations = 5)
@Measurement(iterations = 5)
public class UnsafeCounter_Benchmark {
    public Counter unsync, syncNoV, syncV, lock, atomic, unsafe, unsafeGA;

    @Setup(Level.Iteration)
    public void prepare() {
        unsync = new UnsyncCounter();
        syncNoV = new SyncNoVolatileCounter();
        syncV = new SyncVolatileCounter();
        lock = new LockCounter();
        atomic = new AtomicCounter();
        unsafe = new UnsafeCASCounter();
        unsafeGA = new UnsafeGACounter();
    }

    @Benchmark
    public void unsyncCount() {
        unsyncCounter();
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void unsyncCounter() {
        unsync.increment();
    }

    @Benchmark
    public void syncNoVCount() {
        syncNoVCounter();
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void syncNoVCounter() {
        syncNoV.increment();
    }

    @Benchmark
    public void syncVCount() {
        syncVCounter();
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void syncVCounter() {
        syncV.increment();
    }

    @Benchmark
    public void lockCount() {
        lockCounter();
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void lockCounter() {
        lock.increment();
    }

    @Benchmark
    public void atomicCount() {
        atomicCounter();
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void atomicCounter() {
        atomic.increment();
    }

    @Benchmark
    public void unsafeCount() {
        unsafeCounter();
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void unsafeCounter() {
        unsafe.increment();
    }

    @Benchmark
    public void unsafeGACount() {
        unsafeGACounter();
    }

    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void unsafeGACounter() {
        unsafeGA.increment();
    }

    public static void main(String[] args) throws RunnerException {
        Options baseOpts = new OptionsBuilder()
                .include(UnsafeCounter_Benchmark.class.getSimpleName())
                .threads(100)
                .jvmArgs("-ea")
                .build();

        new Runner(baseOpts).run();
    }
}

试验结果如下： jdk8u20

正如我所期望的，除了

UnsafeCounter\u Benchmark.unsafeCount

之外，大多数度量都是使用

sun.misc.Unsafe.compareAndSwapLong

和

while

循环。它是最慢的锁定

public void increment() {
    long before = counter;
    while (!unsafe.compareAndSwapLong(this, offset, before, before + 1L)) {
        before = counter;
    }
}

我认为低性能是因为while循环和JMH产生了更高的争用，但是当我通过

执行器检查正确性时，我得到了我预期的数字：
Counter result: UnsyncCounter 97538676
Time passed in ms:259
Counter result: AtomicCounter 100000000
Time passed in ms:1805
Counter result: LockCounter 100000000
Time passed in ms:3904
Counter result: SyncNoVolatileCounter 100000000
Time passed in ms:14227
Counter result: SyncVolatileCounter 100000000
Time passed in ms:19224
Counter result: UnsafeCASCounter 100000000
Time passed in ms:8077
Counter result: UnsafeGACounter 100000000
Time passed in ms:2549

正确性测试代码：
public class UnsafeCounter_Test {
    static class CounterClient implements Runnable {
        private Counter c;
        private int num;

        public CounterClient(Counter c, int num) {
            this.c = c;
            this.num = num;
        }

        @Override
        public void run() {
            for (int i = 0; i < num; i++) {
                c.increment();
            }
        }
    }

    public static void makeTest(Counter counter) throws InterruptedException {
        int NUM_OF_THREADS = 1000;
        int NUM_OF_INCREMENTS = 100000;
        ExecutorService service = Executors.newFixedThreadPool(NUM_OF_THREADS);
        long before = System.currentTimeMillis();
        for (int i = 0; i < NUM_OF_THREADS; i++) {
            service.submit(new CounterClient(counter, NUM_OF_INCREMENTS));
        }
        service.shutdown();
        service.awaitTermination(1, TimeUnit.MINUTES);
        long after = System.currentTimeMillis();
        System.out.println("Counter result: " + counter.getClass().getSimpleName() + " " + counter.getCounter());
        System.out.println("Time passed in ms:" + (after - before));
    }

    public static void main(String[] args) throws InterruptedException {
        makeTest(new UnsyncCounter());
        makeTest(new AtomicCounter());
        makeTest(new LockCounter());
        makeTest(new SyncNoVolatileCounter());
        makeTest(new SyncVolatileCounter());
        makeTest(new UnsafeCASCounter());
        makeTest(new UnsafeGACounter());
    }
}

公共类未安全计数器\u测试{
静态类反诉实现可运行{
私人柜台c；
私有整数；
公共反诉（反诉人c，int num）{
这个.c=c；
this.num=num；
}
@凌驾
公开募捐{
for（int i=0；i

我知道这是一个非常糟糕的测试，但在这种情况下，它的速度是Sync变体的两倍，一切都按预期进行。
有人能澄清描述的行为吗？
有关更多信息，请参见GitHub repo:，
大声思考：令人惊讶的是，人们经常做90%的枯燥工作，而把10%（乐趣开始的地方）留给其他人！好吧，我要享受所有的乐趣
让我先在我的i7-4790K，8u40 EA上重复这个实验：
Benchmark                                 Mode  Samples    Score    Error   Units
UnsafeCounter_Benchmark.atomicCount      thrpt        5   47.669 ± 18.440  ops/us
UnsafeCounter_Benchmark.lockCount        thrpt        5   14.497 ±  7.815  ops/us
UnsafeCounter_Benchmark.syncNoVCount     thrpt        5   11.618 ±  2.130  ops/us
UnsafeCounter_Benchmark.syncVCount       thrpt        5   11.337 ±  4.532  ops/us
UnsafeCounter_Benchmark.unsafeCount      thrpt        5    7.452 ±  1.042  ops/us
UnsafeCounter_Benchmark.unsafeGACount    thrpt        5   43.332 ±  3.435  ops/us
UnsafeCounter_Benchmark.unsyncCount      thrpt        5  102.773 ± 11.943  ops/us

的确，unsafeCount
test似乎有些可疑。实际上，在验证之前，您必须假设所有数据都是可疑的。对于nanobenchmarks，您必须验证生成的代码，以查看您是否真正测量了您想要测量的东西。在JMH中，使用-prof perfasm
可以非常快速地执行。事实上，如果你看看那里最热的unsacecount
，你会发现一些有趣的事情：
  0.12%    0.04%    0x00007fb45518e7d1: mov    0x10(%r10),%rax    
 17.03%   23.44%    0x00007fb45518e7d5: test   %eax,0x17318825(%rip)
  0.21%    0.07%    0x00007fb45518e7db: mov    0x18(%r10),%r11    ; getfield offset
 30.33%   10.77%    0x00007fb45518e7df: mov    %rax,%r8
  0.00%             0x00007fb45518e7e2: add    $0x1,%r8           
  0.01%             0x00007fb45518e7e6: cmp    0xc(%r10),%r12d    ; typecheck 
                    0x00007fb45518e7ea: je     0x00007fb45518e80b ; bail to v-call
  0.83%    0.48%    0x00007fb45518e7ec: lock cmpxchg %r8,(%r10,%r11,1)
 33.27%   25.52%    0x00007fb45518e7f2: sete   %r8b
  0.12%    0.01%    0x00007fb45518e7f6: movzbl %r8b,%r8d          
  0.03%    0.04%    0x00007fb45518e7fa: test   %r8d,%r8d
                    0x00007fb45518e7fd: je     0x00007fb45518e7d1 ; back branch

翻译：a）offset
字段在每次迭代中都会被重新读取——因为CAS内存效应意味着易变读取，因此需要悲观地重新读取该字段；b） 有趣的是，出于同样的原因，不安全的
字段也被重新读取以进行打字检查
这就是为什么高性能代码应该是这样的：
--- a/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java       
+++ b/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java       
@@ -5,13 +5,13 @@ import sun.misc.Unsafe;

 public class UnsafeCASCounter implements Counter {
     private volatile long counter = 0;
-    private final Unsafe unsafe = UnsafeHelper.unsafe;
-    private long offset;
-    {
+    private static final Unsafe unsafe = UnsafeHelper.unsafe;
+    private static final long offset;
+    static {
         try {
             offset = unsafe.objectFieldOffset(UnsafeCASCounter.class.getDeclaredField("counter"));
         } catch (NoSuchFieldException e) {
-            e.printStackTrace();
+            throw new IllegalStateException("Whoops!");
         }
     }

如果这样做，unsacecount
性能会立即提升：
Benchmark                              Mode  Samples   Score    Error   Units
UnsafeCounter_Benchmark.unsafeCount    thrpt        5  9.733 ± 0.673  ops/us

…考虑到错误界限，这与现在的同步测试非常接近。如果现在查看-prof perfasm
，这是一个unsafeCount
循环：
  0.08%    0.02%    0x00007f7575191900: mov    0x10(%r10),%rax       
 28.09%   28.64%    0x00007f7575191904: test   %eax,0x161286f6(%rip) 
  0.23%    0.08%    0x00007f757519190a: mov    %rax,%r11
                    0x00007f757519190d: add    $0x1,%r11
                    0x00007f7575191911: lock cmpxchg %r11,0x10(%r10)
 47.27%   23.48%    0x00007f7575191917: sete   %r8b
  0.10%             0x00007f757519191b: movzbl %r8b,%r8d        
  0.02%             0x00007f757519191f: test   %r8d,%r8d
                    0x00007f7575191922: je     0x00007f7575191900  

这个循环非常紧密，似乎没有什么能让它走得更快。我们花了大部分时间加载“更新的”值并实际计算它。但是我们有很多争论！要确定争用是否是主要原因，让我们添加退避：
--- a/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java       
+++ b/utils bench/src/main/java/org/kirmit/utils/unsafe/concurrency/UnsafeCASCounter.java       
@@ -20,6 +21,7 @@ public class UnsafeCASCounter implements Counter {
         long before = counter;
         while (!unsafe.compareAndSwapLong(this, offset, before, before + 1L)) {
             before = counter;
+            Blackhole.consumeCPU(1000);
         }
     }

…运行：
Benchmark                                 Mode  Samples    Score    Error   Units
UnsafeCounter_Benchmark.unsafeCount      thrpt        5   99.869 ± 107.933  ops/us

瞧。我们在循环中做了更多的工作，但它使我们避免了很多竞争。在年之前，我试图解释这一点，回到那里，阅读更多关于基准测试方法的内容可能会更好，尤其是在测量重型操作时。这突出了整个实验中的陷阱，不仅仅是unsafeCount

OP和感兴趣的读者练习：解释为什么unsafeGACount
和atomicCount
比其他测试执行得快得多。你现在有工具了
另外，在有C（C
p.p.S.时间检查：10分钟做分析和附加实验，20分钟写出来。手工复制结果浪费了多少时间？；） 也许我看错了表，但是每微秒20次运算看起来比每微秒12次运算快。这难道不会使同步变体比未同步变体更快吗？另外，与其他可能导致错误的结果相比，unsacecount上的5.9us/op错误看起来相当大。理论上，sync变体应该是我定制测试中显示的最慢的（以检查正确性）。但在JMH版本中，UnsafeCAS是最慢的。它看起来有点可疑，我只能说明原因。嗯，如果没有JMH API，什么是好的CPU（1000）替代方案。在你的记忆里
Benchmark                                 Mode  Samples    Score    Error   Units
UnsafeCounter_Benchmark.unsafeCount      thrpt        5   99.869 ± 107.933  ops/us