C# 简单基准测试中奇怪的性能提升
昨天我发现了一个测试了几种语言(C++、C#、Java、JavaScript)的方法,该方法添加了两点结构(C# 简单基准测试中奇怪的性能提升,c#,performance,benchmarking,cil,C#,Performance,Benchmarking,Cil,昨天我发现了一个测试了几种语言(C++、C#、Java、JavaScript)的方法,该方法添加了两点结构(doubletuple) 原来,C++版本执行大约1Ms的迭代(1E9迭代),而C不能在同一台机器上获得~3000毫秒(并且在X64中执行得更差)。 为了亲自测试它,我采用了C代码(并稍微简化为只调用参数按值传递的方法),然后在i7-3610QM机器(单核3.1Ghz boost)、8GB RAM、Win8.1上运行它,使用.NET 4.5.2,发布构建32位(x86 WoW64,因为我的
double
tuple)
原来,C++版本执行大约1Ms的迭代(1E9迭代),而C不能在同一台机器上获得~3000毫秒(并且在X64中执行得更差)。 为了亲自测试它,我采用了C代码(并稍微简化为只调用参数按值传递的方法),然后在i7-3610QM机器(单核3.1Ghz boost)、8GB RAM、Win8.1上运行它,使用.NET 4.5.2,发布构建32位(x86 WoW64,因为我的操作系统是64位的)。这是简化版:
public static class CSharpTest
{
private const int ITERATIONS = 1000000000;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Point AddByVal(Point a, Point b)
{
return new Point(a.X + b.Y, a.Y + b.X);
}
public static void Main()
{
Point a = new Point(1, 1), b = new Point(1, 1);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
}
运行它会产生与本文类似的结果:
Result: x=1000000001 y=1000000001, Time elapsed: 3159 ms
第一次奇怪的观察
由于该方法应该是内联的,所以我想知道如果我完全删除结构并简单地将整个内容内联在一起,代码将如何执行:
public static class CSharpTest
{
private const int ITERATIONS = 1000000000;
public static void Main()
{
// not using structs at all here
double ax = 1, ay = 1, bx = 1, by = 1;
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
{
ax = ax + by;
ay = ay + bx;
}
sw.Stop();
Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
ax, ay, sw.ElapsedMilliseconds);
}
}
这还意味着基准测试似乎并不衡量任何struct
性能,实际上似乎只衡量基本double
算法(在其他一切都得到优化之后)
奇怪的东西
现在是奇怪的部分。如果我只是在循环外添加另一个秒表(是的,我在几次重试后将其缩小到这个疯狂的步骤),代码运行速度将提高三倍:
public static void Main()
{
var outerSw = Stopwatch.StartNew(); // <-- added
{
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
outerSw.Stop(); // <-- added
}
Result: x=1000000001 y=1000000001, Time elapsed: 961 ms
输出:
Test1: x=1000000001 y=1000000001, Time elapsed: 3242 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 974 ms
Test1: x=1000000001 y=1000000001, Time elapsed: 3251 ms
Test2: x=1000000001 y=1000000001, Time elapsed: 972 ms
您需要在.NET 4.x上以32位版本运行它(代码中有一些检查来确保这一点)
(更新4)
根据@usr对@Hans答案的评论,我检查了这两种方法的优化反汇编,它们有很大不同:
这似乎表明差异可能是由于编译器在第一种情况下的行为很滑稽,而不是双字段对齐
此外,如果我添加两个变量(总偏移量为8字节),我仍然可以获得相同的速度提升,而且这似乎与Hans Passant提到的场对齐无关:
// this is still fast?
private static void Test3()
{
var magical_speed_booster_1 = "whatever";
var magical_speed_booster_2 = "whatever";
{
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Test2: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
GC.KeepAlive(magical_speed_booster_1);
GC.KeepAlive(magical_speed_booster_2);
}
//还是这么快吗?
私有静态void Test3()
{
var magical_speed_booster_1=“任意”;
var magical_speed_booster_2=“任意”;
{
点a=新点(1,1),b=新点(1,1);
var sw=Stopwatch.StartNew();
对于(int i=0;i
抖动中似乎存在一些错误,因为行为更为复杂。考虑下面的代码:
public static void Main()
{
Test1(true);
Test1(false);
Console.ReadLine();
}
public static void Test1(bool warmup)
{
Point a = new Point(1, 1), b = new Point(1, 1);
Stopwatch sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
if (!warmup)
{
Console.WriteLine("Result: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
}
注意我已经从控制台
输出中删除了a.X
和a.Y
引用
我不知道发生了什么,但这对我来说闻起来很糟糕,它与是否有一个外部的
秒表无关,这个问题似乎更普遍一些。缩小了范围(似乎只影响32位CLR 4.0运行时)
注意var f=Stopwatch.Frequency的位置代码>让一切变得不同
慢速(2700毫秒):
static void Test1()
{
Point a = new Point(1, 1), b = new Point(1, 1);
var f = Stopwatch.Frequency;
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
static void Test1()
{
var f = Stopwatch.Frequency;
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
staticvoidtest1()
{
点a=新点(1,1),b=新点(1,1);
var f=秒表频率;
var sw=Stopwatch.StartNew();
对于(int i=0;i
快速(800毫秒):
static void Test1()
{
Point a = new Point(1, 1), b = new Point(1, 1);
var f = Stopwatch.Frequency;
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
static void Test1()
{
var f = Stopwatch.Frequency;
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
staticvoidtest1()
{
var f=秒表频率;
点a=新点(1,1),b=新点(1,1);
var sw=Stopwatch.StartNew();
对于(int i=0;i
有一种非常简单的方法可以始终获得程序的“快速”版本。项目>属性>构建选项卡,取消选中“首选32位”选项,确保平台目标选择为AnyCPU
您确实不喜欢32位,不幸的是,对于C#项目,默认情况下总是启用32位。从历史上看,VisualStudio工具集在32位进程上工作得更好,这是微软一直在解决的老问题。是时候删除该选项了,VS2015特别解决了最后几个真正的64位代码的实际障碍,采用了全新的x64抖动和对编辑+继续的通用支持
够了,你发现了变量对齐的重要性。处理器非常关心它。如果一个变量在内存中错误对齐,那么处理器必须做额外的工作来洗牌字节,以使它们按正确的顺序排列。有两个明显的错位问题,一个是字节仍然在一个一级缓存线内,这需要额外的一个周期才能将它们移动到正确的位置。还有一个特别糟糕的,就是你发现的,部分字节在一个缓存线中,另一个缓存线中。这需要两个独立的内存访问并将它们粘合在一起。慢了三倍
double
和long
类型是32位进程中的麻烦制造者。它们的大小为64位。因此,CLR只能保证32位对齐。在64位进程中不是问题,所有变量都保证与8对齐。这也是为什么C语言不能保证它们是原子的根本原因。以及当double的数组有1000多个元素时,为什么会在大型对象堆中分配它们。LOH提供了8的对齐保证。并解释了为什么添加一个局部变量解决了这个问题,一个对象引用是4字节,所以它将双变量移动了4,现在使其对齐。偶然地
static void Test1()
{
Point a = new Point(1, 1), b = new Point(1, 1);
var f = Stopwatch.Frequency;
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
static void Test1()
{
var f = Stopwatch.Frequency;
Point a = new Point(1, 1), b = new Point(1, 1);
var sw = Stopwatch.StartNew();
for (int i = 0; i < ITERATIONS; i++)
a = AddByVal(a, b);
sw.Stop();
Console.WriteLine("Test1: x={0} y={1}, Time elapsed: {2} ms",
a.X, a.Y, sw.ElapsedMilliseconds);
}
[BenchmarkTask(platform: BenchmarkPlatform.X86)]
public class Jit_RegistersVsStack
{
private const int IterationCount = 100001;
[Benchmark]
[OperationsPerInvoke(IterationCount)]
public string WithoutStopwatch()
{
double a = 1, b = 1;
for (int i = 0; i < IterationCount; i++)
{
// fld1
// faddp st(1),st
a = a + b;
}
return string.Format("{0}", a);
}
[Benchmark]
[OperationsPerInvoke(IterationCount)]
public string WithStopwatch()
{
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < IterationCount; i++)
{
// fld1
// fadd qword ptr [ebp-14h]
// fstp qword ptr [ebp-14h]
a = a + b;
}
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
[Benchmark]
[OperationsPerInvoke(IterationCount)]
public string WithTwoStopwatches()
{
var outerSw = new Stopwatch();
double a = 1, b = 1;
var sw = new Stopwatch();
for (int i = 0; i < IterationCount; i++)
{
// fld1
// faddp st(1),st
a = a + b;
}
return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}
}
BenchmarkDotNet=v0.7.7.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4702MQ CPU @ 2.20GHz, ProcessorCount=8
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Type=Jit_RegistersVsStack Mode=Throughput Platform=X86 Jit=HostJit .NET=HostFramework
Method | AvrTime | StdDev | op/s |
------------------- |---------- |---------- |----------- |
WithoutStopwatch | 1.0333 ns | 0.0028 ns | 967,773.78 |
WithStopwatch | 3.4453 ns | 0.0492 ns | 290,247.33 |
WithTwoStopwatches | 1.0435 ns | 0.0341 ns | 958,302.81 |