Java 提高性能一致性的方法

Java 提高性能一致性的方法,java,performance,memory,concurrency,jvm,Java,Performance,Memory,Concurrency,Jvm,在下面的示例中,一个线程通过用户使用的ByteBuffer发送“消息”。最佳性能非常好,但并不一致 public class Main { public static void main(String... args) throws IOException { for (int i = 0; i < 10; i++) doTest(); } public static void doTest() { fina

在下面的示例中,一个线程通过用户使用的ByteBuffer发送“消息”。最佳性能非常好,但并不一致

public class Main {
    public static void main(String... args) throws IOException {
        for (int i = 0; i < 10; i++)
            doTest();
    }

    public static void doTest() {
        final ByteBuffer writeBuffer = ByteBuffer.allocateDirect(64 * 1024);
        final ByteBuffer readBuffer = writeBuffer.slice();
        final AtomicInteger readCount = new PaddedAtomicInteger();
        final AtomicInteger writeCount = new PaddedAtomicInteger();

        for(int i=0;i<3;i++)
            performTiming(writeBuffer, readBuffer, readCount, writeCount);
        System.out.println();
    }

    private static void performTiming(ByteBuffer writeBuffer, final ByteBuffer readBuffer, final AtomicInteger readCount, final AtomicInteger writeCount) {
        writeBuffer.clear();
        readBuffer.clear();
        readCount.set(0);
        writeCount.set(0);

        Thread t = new Thread(new Runnable() {
            @Override
            public void run() {
                byte[] bytes = new byte[128];
                while (!Thread.interrupted()) {
                    int rc = readCount.get(), toRead;
                    while ((toRead = writeCount.get() - rc) <= 0) ;
                    for (int i = 0; i < toRead; i++) {
                        byte len = readBuffer.get();
                        if (len == -1) {
                            // rewind.
                            readBuffer.clear();
//                            rc++;
                        } else {
                            int num = readBuffer.getInt();
                            if (num != rc)
                                throw new AssertionError("Expected " + rc + " but got " + num) ;
                            rc++;
                            readBuffer.get(bytes, 0, len - 4);
                        }
                    }
                    readCount.lazySet(rc);
                }
            }
        });
        t.setDaemon(true);
        t.start();
        Thread.yield();
        long start = System.nanoTime();
        int runs = 30 * 1000 * 1000;
        int len = 32;
        byte[] bytes = new byte[len - 4];
        int wc = writeCount.get();
        for (int i = 0; i < runs; i++) {
            if (writeBuffer.remaining() < len + 1) {
                // reader has to catch up.
                while (wc - readCount.get() > 0) ;
                // rewind.
                writeBuffer.put((byte) -1);
                writeBuffer.clear();
            }
            writeBuffer.put((byte) len);
            writeBuffer.putInt(i);
            writeBuffer.put(bytes);
            writeCount.lazySet(++wc);
        }
        // reader has to catch up.
        while (wc - readCount.get() > 0) ;
        t.interrupt();
        t.stop();
        long time = System.nanoTime() - start;
        System.out.printf("Message rate was %.1f M/s offsets %d %d %d%n", runs * 1e3 / time
                , addressOf(readBuffer) - addressOf(writeBuffer)
                , addressOf(readCount) - addressOf(writeBuffer)
                , addressOf(writeCount) - addressOf(writeBuffer)
        );
    }

    // assumes -XX:+UseCompressedOops.
    public static long addressOf(Object... o) {
        long offset = UNSAFE.arrayBaseOffset(o.getClass());
        return UNSAFE.getInt(o, offset) * 8L;
    }

    public static final Unsafe UNSAFE = getUnsafe();
    public static Unsafe getUnsafe() {
        try {
            Field field = Unsafe.class.getDeclaredField("theUnsafe");
            field.setAccessible(true);
            return (Unsafe) field.get(null);
        } catch (Exception e) {
            throw new AssertionError(e);
        }
    }

    private static class PaddedAtomicInteger extends AtomicInteger {
        public long p2, p3, p4, p5, p6, p7;

        public long sum() {
//            return 0;
            return p2 + p3 + p4 + p5 + p6 + p7;
        }
    }
}
每组缓冲区和计数器测试三次,这些缓冲区似乎给出了类似的结果。因此,我相信这些缓冲区在内存中的排列方式有一些我没有看到的东西

是否有任何东西可以更频繁地提供更高的性能?看起来像是缓存冲突,但我看不出这可能发生在哪里

顺便说一句:
M/s
是每秒数百万条消息,比任何人可能需要的都多,但最好了解如何使其始终保持快速


编辑:使用synchronized with wait and notify使结果更加一致。但不是更快

Message rate was 6.9 M/s
Message rate was 7.8 M/s
Message rate was 7.9 M/s
Message rate was 6.7 M/s
Message rate was 7.5 M/s
Message rate was 7.7 M/s
Message rate was 7.3 M/s
Message rate was 7.9 M/s
Message rate was 6.4 M/s
Message rate was 7.8 M/s

编辑:通过使用任务集,如果我锁定两个线程以更改相同的内核,我可以使性能保持一致

Message rate was 35.1 M/s offsets 136 200 216
Message rate was 34.0 M/s offsets 136 200 216
Message rate was 35.4 M/s offsets 136 200 216

Message rate was 35.6 M/s offsets 136 200 216
Message rate was 37.0 M/s offsets 136 200 216
Message rate was 37.2 M/s offsets 136 200 216

Message rate was 37.1 M/s offsets 136 200 216
Message rate was 35.0 M/s offsets 136 200 216
Message rate was 37.1 M/s offsets 136 200 216

If I use any two logical threads on different cores, I get the inconsistent behaviour

Message rate was 60.2 M/s offsets 136 200 216
Message rate was 68.7 M/s offsets 136 200 216
Message rate was 55.3 M/s offsets 136 200 216

Message rate was 39.2 M/s offsets 136 200 216
Message rate was 39.1 M/s offsets 136 200 216
Message rate was 37.5 M/s offsets 136 200 216

Message rate was 75.3 M/s offsets 136 200 216
Message rate was 73.8 M/s offsets 136 200 216
Message rate was 66.8 M/s offsets 136 200 216

编辑:触发GC似乎会改变行为。这些显示了使用手动触发器在相同的缓冲区+计数器上重复测试

faster after GC

Message rate was 27.4 M/s offsets 136 200 216
Message rate was 27.8 M/s offsets 136 200 216
Message rate was 29.6 M/s offsets 136 200 216
Message rate was 27.7 M/s offsets 136 200 216
Message rate was 29.6 M/s offsets 136 200 216
[GC 14312K->1518K(244544K), 0.0003050 secs]
[Full GC 1518K->1328K(244544K), 0.0068270 secs]
Message rate was 34.7 M/s offsets 64 128 144
Message rate was 54.5 M/s offsets 64 128 144
Message rate was 54.1 M/s offsets 64 128 144
Message rate was 51.9 M/s offsets 64 128 144
Message rate was 57.2 M/s offsets 64 128 144

and slower

Message rate was 61.1 M/s offsets 136 200 216
Message rate was 61.8 M/s offsets 136 200 216
Message rate was 60.5 M/s offsets 136 200 216
Message rate was 61.1 M/s offsets 136 200 216
[GC 35740K->1440K(244544K), 0.0018170 secs]
[Full GC 1440K->1302K(244544K), 0.0071290 secs]
Message rate was 53.9 M/s offsets 64 128 144
Message rate was 54.3 M/s offsets 64 128 144
Message rate was 50.8 M/s offsets 64 128 144
Message rate was 56.6 M/s offsets 64 128 144
Message rate was 56.0 M/s offsets 64 128 144
Message rate was 53.6 M/s offsets 64 128 144

编辑:使用@BegemoT的库打印使用的核心id,我在3.8 GHz i7(家用PC)上获得以下信息

注:偏移不正确,误差系数为8。由于堆的大小很小,JVM不会像对待更大(但小于32GB)的堆那样将引用乘以8

您可以看到正在使用相同的逻辑线程,但在不同的运行之间,性能有所不同,但在一个运行中(在一个运行中,使用的对象相同)


我发现了问题所在。这是一个内存布局问题,但我可以找到一个简单的方法来解决它。ByteBuffer无法扩展,因此无法添加填充,因此我创建了一个丢弃的对象

    final ByteBuffer writeBuffer = ByteBuffer.allocateDirect(64 * 1024);
    final ByteBuffer readBuffer = writeBuffer.slice();
    new PaddedAtomicInteger();
    final AtomicInteger readCount = new PaddedAtomicInteger();
    final AtomicInteger writeCount = new PaddedAtomicInteger();
如果没有这个额外的填充(未使用的对象),结果在3.8 GHz i7上看起来是这样的

Message rate was 38.5 M/s offsets 3392 3904 4416
Message rate was 54.7 M/s offsets 3392 3904 4416
Message rate was 59.4 M/s offsets 3392 3904 4416

Message rate was 54.3 M/s offsets 1088 1600 2112
Message rate was 56.3 M/s offsets 1088 1600 2112
Message rate was 56.6 M/s offsets 1088 1600 2112

Message rate was 28.0 M/s offsets 1088 1600 2112
Message rate was 28.1 M/s offsets 1088 1600 2112
Message rate was 28.0 M/s offsets 1088 1600 2112

Message rate was 17.4 M/s offsets 1088 1600 2112
Message rate was 17.4 M/s offsets 1088 1600 2112
Message rate was 17.4 M/s offsets 1088 1600 2112

Message rate was 54.5 M/s offsets 1088 1600 2112
Message rate was 54.2 M/s offsets 1088 1600 2112
Message rate was 55.1 M/s offsets 1088 1600 2112

Message rate was 25.5 M/s offsets 1088 1600 2112
Message rate was 25.6 M/s offsets 1088 1600 2112
Message rate was 25.6 M/s offsets 1088 1600 2112

Message rate was 56.6 M/s offsets 1088 1600 2112
Message rate was 54.7 M/s offsets 1088 1600 2112
Message rate was 54.4 M/s offsets 1088 1600 2112

Message rate was 57.0 M/s offsets 1088 1600 2112
Message rate was 55.9 M/s offsets 1088 1600 2112
Message rate was 56.3 M/s offsets 1088 1600 2112

Message rate was 51.4 M/s offsets 1088 1600 2112
Message rate was 56.6 M/s offsets 1088 1600 2112
Message rate was 56.1 M/s offsets 1088 1600 2112

Message rate was 46.4 M/s offsets 1088 1600 2112
Message rate was 46.4 M/s offsets 1088 1600 2112
Message rate was 47.4 M/s offsets 1088 1600 2112
使用丢弃的填充对象

Message rate was 54.3 M/s offsets 3392 4416 4928
Message rate was 53.1 M/s offsets 3392 4416 4928
Message rate was 59.2 M/s offsets 3392 4416 4928

Message rate was 58.8 M/s offsets 1088 2112 2624
Message rate was 58.9 M/s offsets 1088 2112 2624
Message rate was 59.3 M/s offsets 1088 2112 2624

Message rate was 59.4 M/s offsets 1088 2112 2624
Message rate was 59.0 M/s offsets 1088 2112 2624
Message rate was 59.8 M/s offsets 1088 2112 2624

Message rate was 59.8 M/s offsets 1088 2112 2624
Message rate was 59.8 M/s offsets 1088 2112 2624
Message rate was 59.2 M/s offsets 1088 2112 2624

Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624

Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.9 M/s offsets 1088 2112 2624
Message rate was 60.6 M/s offsets 1088 2112 2624

Message rate was 59.6 M/s offsets 1088 2112 2624
Message rate was 60.3 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624

Message rate was 60.9 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624
Message rate was 60.5 M/s offsets 1088 2112 2624

Message rate was 60.7 M/s offsets 1088 2112 2624
Message rate was 61.6 M/s offsets 1088 2112 2624
Message rate was 60.8 M/s offsets 1088 2112 2624

Message rate was 60.3 M/s offsets 1088 2112 2624
Message rate was 60.7 M/s offsets 1088 2112 2624
Message rate was 58.3 M/s offsets 1088 2112 2624

不幸的是,在GC之后,总是存在这样一种风险,即对象不会得到最佳布局。解决此问题的唯一方法可能是向原始类添加填充:(

作为性能分析的一般方法:

  • 尝试。启动你的应用程序,当它运行时,在单独的终端窗口中键入
    jconsole
    。这将打开Java控制台GUI,允许你连接到正在运行的JVM,并查看性能指标、内存使用、线程数和状态等
  • 基本上,你必须弄清楚速度变化和JVM所做的事情之间的关系。打开任务管理器,看看你的系统是否正忙于做其他事情(由于内存不足而分页到磁盘,忙于繁重的后台任务,等等),这也会很有帮助并将其与
    jconsole
    窗口并排放置
  • 另一种选择是使用
    -Xprof
    选项启动JVM,该选项输出每个线程在各种方法中花费的相对时间。例如
    java-Xprof[您的类文件]
  • 最后,还有,但这是一个商业工具,如果这对你很重要的话

    • 您正忙着等待。这在用户代码中总是一个坏主意

      读者:

      while ((toRead = writeCount.get() - rc) <= 0) ;
      
      编辑:触发GC似乎会改变行为 使用手动触发器在相同的缓冲区+计数器上显示重复测试 半路

      faster after GC
      
      Message rate was 27.4 M/s offsets 136 200 216
      Message rate was 27.8 M/s offsets 136 200 216
      Message rate was 29.6 M/s offsets 136 200 216
      Message rate was 27.7 M/s offsets 136 200 216
      Message rate was 29.6 M/s offsets 136 200 216
      [GC 14312K->1518K(244544K), 0.0003050 secs]
      [Full GC 1518K->1328K(244544K), 0.0068270 secs]
      Message rate was 34.7 M/s offsets 64 128 144
      Message rate was 54.5 M/s offsets 64 128 144
      Message rate was 54.1 M/s offsets 64 128 144
      Message rate was 51.9 M/s offsets 64 128 144
      Message rate was 57.2 M/s offsets 64 128 144
      
      and slower
      
      Message rate was 61.1 M/s offsets 136 200 216
      Message rate was 61.8 M/s offsets 136 200 216
      Message rate was 60.5 M/s offsets 136 200 216
      Message rate was 61.1 M/s offsets 136 200 216
      [GC 35740K->1440K(244544K), 0.0018170 secs]
      [Full GC 1440K->1302K(244544K), 0.0071290 secs]
      Message rate was 53.9 M/s offsets 64 128 144
      Message rate was 54.3 M/s offsets 64 128 144
      Message rate was 50.8 M/s offsets 64 128 144
      Message rate was 56.6 M/s offsets 64 128 144
      Message rate was 56.0 M/s offsets 64 128 144
      Message rate was 53.6 M/s offsets 64 128 144
      
      GC意味着到达一个安全点,这意味着所有线程都停止执行字节码&GC线程还有工作要做。这可能会产生各种副作用。例如,在没有任何明确的cpu关联的情况下,您可以在不同的内核上重新启动执行,或者缓存线可能已刷新。您可以跟踪线程运行的内核吗宁安

      这些CPU是什么?您是否对电源管理做了任何工作,以防止它们下降到较低的p和/或c状态?可能有一个线程被调度到处于不同p状态的内核上,因此显示了不同的性能配置文件

      编辑

      我尝试在运行x64 linux的工作站上运行您的测试,该工作站使用2个稍旧的quadcore Xeon(E5504),在一次运行中(约17-18M/s)通常是一致的with Operation运行速度慢得多,这似乎与线程迁移相对应。我没有严格地对此进行描述。因此,您的问题可能与CPU架构有关。您提到您正在运行4.6GHz的i7,这是一个错误吗?我认为i7的最高版本为(早期版本为3.3GHz到3.6GHz turbo)。无论哪种方式,您确定没有看到涡轮模式启动然后退出的伪影吗?您可以尝试在禁用涡轮模式的情况下重复测试以确保

      还有几点

        <> LI>填充值都是0,您确定没有特殊的处理被赋予未初始化的值吗?您可以考虑使用<代码>日志编译< /代码>选项来理解JIT如何对待该方法。
      • 免费进行30天的评估,如果这是缓存线问题,那么您可以使用它来确定主机上的问题

      我不是处理器缓存领域的专家,但我怀疑您的问题本质上是缓存问题或其他一些内存布局问题。在不清理旧对象的情况下重复分配缓冲区和计数器可能会导致您定期获得非常糟糕的缓存布局,这可能会导致性能不一致。

      使用您的代码并制作一些MOD,我已经能够使性能保持一致(我的测试机器是Intel Core2 Quad CPU Q6600 2.4GHz w/Win7x64-虽然不完全相同,但希望足够接近,以获得相关的结果)。我用两种不同的方法完成了这项工作,这两种方法都具有大致相同的效果

      首先,将缓冲区和计数器的创建移到doTest方法之外,以便只创建一次缓冲区和计数器
      while ((toRead = writeCount.get() - rc) <= 0) ;
      
      while (wc - readCount.get() > 0) ;
      
      for ( int i = 0; i < 3; i++ )
          performTiming ( writeBuffer, readBuffer, readCount, writeCount );
      System.out.println ();
      System.gc ();