C# 简单基准测试中奇怪的性能提升

C# 简单基准测试中奇怪的性能提升,c#,performance,benchmarking,cil,C#,Performance,Benchmarking,Cil,昨天我发现了一个测试了几种语言(C++、C#、Java、JavaScript)的方法,该方法添加了两点结构(doubletuple) 原来,C++版本执行大约1Ms的迭代(1E9迭代),而C不能在同一台机器上获得~3000毫秒(并且在X64中执行得更差)。 为了亲自测试它,我采用了C代码(并稍微简化为只调用参数按值传递的方法),然后在i7-3610QM机器(单核3.1Ghz boost)、8GB RAM、Win8.1上运行它,使用.NET 4.5.2,发布构建32位(x86 WoW64,因为我的

昨天我发现了一个测试了几种语言(C++、C#、Java、JavaScript)的方法,该方法添加了两点结构(
double
tuple)

原来,C++版本执行大约1Ms的迭代(1E9迭代),而C不能在同一台机器上获得~3000毫秒(并且在X64中执行得更差)。 为了亲自测试它,我采用了C代码(并稍微简化为只调用参数按值传递的方法),然后在i7-3610QM机器(单核3.1Ghz boost)、8GB RAM、Win8.1上运行它,使用.NET 4.5.2,发布构建32位(x86 WoW64,因为我的操作系统是64位的)。这是简化版:

public static class CSharpTest
{
    private const int ITERATIONS = 1000000000;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static Point AddByVal(Point a, Point b)
    {
        return new Point(a.X + b.Y, a.Y + b.X);
    }

    public static void Main()
    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms", 
            a.X, a.Y, sw.ElapsedMilliseconds);
    }
}
运行它会产生与本文类似的结果:

Result: x=1000000001 y=1000000001, Time elapsed: 3159 ms
第一次奇怪的观察

由于该方法应该是内联的,所以我想知道如果我完全删除结构并简单地将整个内容内联在一起,代码将如何执行:

public static class CSharpTest
{
    private const int ITERATIONS = 1000000000;

    public static void Main()
    {
        // not using structs at all here
        double ax = 1, ay = 1, bx = 1, by = 1;

        Stopwatch sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
        {
            ax = ax + by;
            ay = ay + bx;
        }
        sw.Stop();

        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms", 
            ax, ay, sw.ElapsedMilliseconds);
    }
}
这还意味着基准测试似乎并不衡量任何
struct
性能,实际上似乎只衡量基本
double
算法(在其他一切都得到优化之后)

奇怪的东西

现在是奇怪的部分。如果我只是在循环外添加另一个秒表(是的,我在几次重试后将其缩小到这个疯狂的步骤),代码运行速度将提高三倍:

public static void Main()
{
    var outerSw = Stopwatch.StartNew();     // <-- added

    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
            a.X, a.Y, sw.ElapsedMilliseconds);
    }

    outerSw.Stop();                         // <-- added
}

Result: x=1000000001 y=1000000001, Time elapsed: 961 ms
输出:

Test1: x=1000000001 y=1000000001, Time elapsed: 3242 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 974 ms

Test1: x=1000000001 y=1000000001, Time elapsed: 3251 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 972 ms
您需要在.NET 4.x上以32位版本运行它(代码中有一些检查来确保这一点)

(更新4)

根据@usr对@Hans答案的评论,我检查了这两种方法的优化反汇编,它们有很大不同:

这似乎表明差异可能是由于编译器在第一种情况下的行为很滑稽,而不是双字段对齐

此外,如果我添加两个变量(总偏移量为8字节),我仍然可以获得相同的速度提升,而且这似乎与Hans Passant提到的场对齐无关:

// this is still fast?
private static void Test3()
{
    var magical_speed_booster_1 = "whatever";
    var magical_speed_booster_2 = "whatever";

    {
        Point a = new Point(1, 1), b = new Point(1, 1);

        var sw = Stopwatch.StartNew();
        for (int i = 0; i < ITERATIONS; i++)
            a = AddByVal(a, b);
        sw.Stop();

        Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms",
            a.X, a.Y, sw.ElapsedMilliseconds);
    }

    GC.KeepAlive(magical_speed_booster_1);
    GC.KeepAlive(magical_speed_booster_2);
}
//还是这么快吗?
私有静态void Test3()
{
var magical_speed_booster_1=“任意”;
var magical_speed_booster_2=“任意”;
{
点a=新点(1,1),b=新点(1,1);
var sw=Stopwatch.StartNew();
对于(int i=0;i
抖动中似乎存在一些错误,因为行为更为复杂。考虑下面的代码:

public static void Main()
{
    Test1(true);
    Test1(false);
    Console.ReadLine();
}

public static void Test1(bool warmup)
{
    Point a = new Point(1, 1), b = new Point(1, 1);

    Stopwatch sw = Stopwatch.StartNew();
    for (int i = 0; i < ITERATIONS; i++)
        a = AddByVal(a, b);
    sw.Stop();

    if (!warmup)
    {
        Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
            a.X, a.Y, sw.ElapsedMilliseconds);
    }
}
注意我已经从
控制台
输出中删除了
a.X
a.Y
引用


我不知道发生了什么,但这对我来说闻起来很糟糕,它与是否有一个外部的
秒表无关,这个问题似乎更普遍一些。

缩小了范围(似乎只影响32位CLR 4.0运行时)

注意
var f=Stopwatch.Frequency的位置让一切变得不同

慢速(2700毫秒):

static void Test1()
{
  Point a = new Point(1, 1), b = new Point(1, 1);
  var f = Stopwatch.Frequency;

  var sw = Stopwatch.StartNew();
  for (int i = 0; i < ITERATIONS; i++)
    a = AddByVal(a, b);
  sw.Stop();

  Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
      a.X, a.Y, sw.ElapsedMilliseconds);
}
static void Test1()
{
  var f = Stopwatch.Frequency;
  Point a = new Point(1, 1), b = new Point(1, 1);

  var sw = Stopwatch.StartNew();
  for (int i = 0; i < ITERATIONS; i++)
    a = AddByVal(a, b);
  sw.Stop();

  Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
      a.X, a.Y, sw.ElapsedMilliseconds);
}
staticvoidtest1()
{
点a=新点(1,1),b=新点(1,1);
var f=秒表频率;
var sw=Stopwatch.StartNew();
对于(int i=0;i
快速(800毫秒):

static void Test1()
{
  Point a = new Point(1, 1), b = new Point(1, 1);
  var f = Stopwatch.Frequency;

  var sw = Stopwatch.StartNew();
  for (int i = 0; i < ITERATIONS; i++)
    a = AddByVal(a, b);
  sw.Stop();

  Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
      a.X, a.Y, sw.ElapsedMilliseconds);
}
static void Test1()
{
  var f = Stopwatch.Frequency;
  Point a = new Point(1, 1), b = new Point(1, 1);

  var sw = Stopwatch.StartNew();
  for (int i = 0; i < ITERATIONS; i++)
    a = AddByVal(a, b);
  sw.Stop();

  Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
      a.X, a.Y, sw.ElapsedMilliseconds);
}
staticvoidtest1()
{
var f=秒表频率;
点a=新点(1,1),b=新点(1,1);
var sw=Stopwatch.StartNew();
对于(int i=0;i
有一种非常简单的方法可以始终获得程序的“快速”版本。项目>属性>构建选项卡,取消选中“首选32位”选项,确保平台目标选择为AnyCPU

您确实不喜欢32位,不幸的是,对于C#项目,默认情况下总是启用32位。从历史上看,VisualStudio工具集在32位进程上工作得更好,这是微软一直在解决的老问题。是时候删除该选项了,VS2015特别解决了最后几个真正的64位代码的实际障碍,采用了全新的x64抖动和对编辑+继续的通用支持

够了,你发现了变量对齐的重要性。处理器非常关心它。如果一个变量在内存中错误对齐,那么处理器必须做额外的工作来洗牌字节,以使它们按正确的顺序排列。有两个明显的错位问题,一个是字节仍然在一个一级缓存线内,这需要额外的一个周期才能将它们移动到正确的位置。还有一个特别糟糕的,就是你发现的,部分字节在一个缓存线中,另一个缓存线中。这需要两个独立的内存访问并将它们粘合在一起。慢了三倍

double
long
类型是32位进程中的麻烦制造者。它们的大小为64位。因此,CLR只能保证32位对齐。在64位进程中不是问题,所有变量都保证与8对齐。这也是为什么C语言不能保证它们是原子的根本原因。以及当double的数组有1000多个元素时,为什么会在大型对象堆中分配它们。LOH提供了8的对齐保证。并解释了为什么添加一个局部变量解决了这个问题,一个对象引用是4字节,所以它将双变量移动了4,现在使其对齐。偶然地
static void Test1()
{
  Point a = new Point(1, 1), b = new Point(1, 1);
  var f = Stopwatch.Frequency;

  var sw = Stopwatch.StartNew();
  for (int i = 0; i < ITERATIONS; i++)
    a = AddByVal(a, b);
  sw.Stop();

  Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
      a.X, a.Y, sw.ElapsedMilliseconds);
}
static void Test1()
{
  var f = Stopwatch.Frequency;
  Point a = new Point(1, 1), b = new Point(1, 1);

  var sw = Stopwatch.StartNew();
  for (int i = 0; i < ITERATIONS; i++)
    a = AddByVal(a, b);
  sw.Stop();

  Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
      a.X, a.Y, sw.ElapsedMilliseconds);
}
[BenchmarkTask(platform: BenchmarkPlatform.X86)]
public class Jit_RegistersVsStack
{
    private const int IterationCount = 100001;

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithoutStopwatch()
    {
        double a = 1, b = 1;
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // faddp       st(1),st
            a = a + b;
        }
        return string.Format("{0}", a);
    }

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithStopwatch()
    {
        double a = 1, b = 1;
        var sw = new Stopwatch();
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // fadd        qword ptr [ebp-14h]
            // fstp        qword ptr [ebp-14h]
            a = a + b;
        }
        return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
    }

    [Benchmark]
    [OperationsPerInvoke(IterationCount)]
    public string WithTwoStopwatches()
    {
        var outerSw = new Stopwatch();
        double a = 1, b = 1;
        var sw = new Stopwatch();
        for (int i = 0; i < IterationCount; i++)
        {
            // fld1  
            // faddp       st(1),st
            a = a + b;
        }
        return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
    }
}
BenchmarkDotNet=v0.7.7.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz, ProcessorCount=8
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit  [RyuJIT]
Type=Jit_RegistersVsStack  Mode=Throughput  Platform=X86  Jit=HostJit  .NET=HostFramework

             Method |   AvrTime |    StdDev |       op/s |
------------------- |---------- |---------- |----------- |
   WithoutStopwatch | 1.0333 ns | 0.0028 ns | 967,773.78 |
      WithStopwatch | 3.4453 ns | 0.0492 ns | 290,247.33 |
 WithTwoStopwatches | 1.0435 ns | 0.0341 ns | 958,302.81 |