Assembly 编译程序集与手写程序集的性能差异
我一直在玩围棋的汇编语言,我写了一个函数作为练习 我已经基于本机Go版本,而程序集版本基于。在对这两个函数进行基准测试后,我发现原生Go版本比汇编版本快1.5倍-2倍左右,尽管手写汇编版本与Assembly 编译程序集与手写程序集的性能差异,assembly,go,hamming-distance,Assembly,Go,Hamming Distance,我一直在玩围棋的汇编语言,我写了一个函数作为练习 我已经基于本机Go版本,而程序集版本基于。在对这两个函数进行基准测试后,我发现原生Go版本比汇编版本快1.5倍-2倍左右,尽管手写汇编版本与Go tool 6g-S popcount.Go的输出几乎相同 从go test-bench=。 PASS BenchmarkPopCount 100000000 19.4 ns/op BenchmarkPopCount_g 200000000
Go tool 6g-S popcount.Go的输出几乎相同
从go test-bench=。
PASS
BenchmarkPopCount 100000000 19.4 ns/op
BenchmarkPopCount_g 200000000 8.97 ns/op
ok popcount 4.777s
"".popCount_g t=1 size=64 value=0 args=0x10 locals=0
000000 00000 (popcount.go:5) TEXT "".popCount_g+0(SB),4,$0-16
000000 00000 (popcount.go:5) NOP ,
000000 00000 (popcount.go:5) NOP ,
000000 00000 (popcount.go:5) MOVL "".i+8(FP),BP
0x0004 00004 (popcount.go:5) FUNCDATA $2,gclocals┬À9308e7ef08d2cc2f72ae1228688dacf9+0(SB)
0x0004 00004 (popcount.go:5) FUNCDATA $3,gclocals┬À3280bececceccd33cb74587feedb1f9f+0(SB)
0x0004 00004 (popcount.go:6) MOVL BP,BX
0x0006 00006 (popcount.go:6) SHRL $1,BX
0x0008 00008 (popcount.go:6) ANDL $1431655765,BX
0x000e 00014 (popcount.go:6) SUBL BX,BP
0x0010 00016 (popcount.go:7) MOVL BP,AX
0x0012 00018 (popcount.go:7) ANDL $858993459,AX
0x0017 00023 (popcount.go:7) SHRL $2,BP
0x001a 00026 (popcount.go:7) ANDL $858993459,BP
0x0020 00032 (popcount.go:7) ADDL BP,AX
0x0022 00034 (popcount.go:8) MOVL AX,BX
0x0024 00036 (popcount.go:8) SHRL $4,BX
0x0027 00039 (popcount.go:8) ADDL AX,BX
0x0029 00041 (popcount.go:8) ANDL $252645135,BX
0x002f 00047 (popcount.go:8) IMULL $16843009,BX
0x0035 00053 (popcount.go:8) SHRL $24,BX
0x0038 00056 (popcount.go:8) MOVL BX,"".~r1+16(FP)
0x003c 00060 (popcount.go:8) RET ,
popcount.go
package popcount
func popCount(i uint32) uint32 // Defined in popcount_amd64.s
func popCount_g(i uint32) uint32 {
i = i - ((i >> 1) & 0x55555555)
i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24
}
package popcount
import "testing"
func TestPopcount(t *testing.T) {
for i := uint32(0); i < uint32(100); i++ {
if popCount(i) != popCount_g(i) {
t.Fatalf("failed on input = %v", i)
}
}
}
func BenchmarkPopCount(b *testing.B) {
for i := 0; i < b.N; i++ {
popCount(uint32(i))
}
}
func BenchmarkPopCount_g(b *testing.B) {
for i := 0; i < b.N; i++ {
popCount_g(uint32(i))
}
}
popcount\u测试开始
package popcount
func popCount(i uint32) uint32 // Defined in popcount_amd64.s
func popCount_g(i uint32) uint32 {
i = i - ((i >> 1) & 0x55555555)
i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24
}
package popcount
import "testing"
func TestPopcount(t *testing.T) {
for i := uint32(0); i < uint32(100); i++ {
if popCount(i) != popCount_g(i) {
t.Fatalf("failed on input = %v", i)
}
}
}
func BenchmarkPopCount(b *testing.B) {
for i := 0; i < b.N; i++ {
popCount(uint32(i))
}
}
func BenchmarkPopCount_g(b *testing.B) {
for i := 0; i < b.N; i++ {
popCount_g(uint32(i))
}
}
go工具6g-S popcount.go的输出
PASS
BenchmarkPopCount 100000000 19.4 ns/op
BenchmarkPopCount_g 200000000 8.97 ns/op
ok popcount 4.777s
"".popCount_g t=1 size=64 value=0 args=0x10 locals=0
000000 00000 (popcount.go:5) TEXT "".popCount_g+0(SB),4,$0-16
000000 00000 (popcount.go:5) NOP ,
000000 00000 (popcount.go:5) NOP ,
000000 00000 (popcount.go:5) MOVL "".i+8(FP),BP
0x0004 00004 (popcount.go:5) FUNCDATA $2,gclocals┬À9308e7ef08d2cc2f72ae1228688dacf9+0(SB)
0x0004 00004 (popcount.go:5) FUNCDATA $3,gclocals┬À3280bececceccd33cb74587feedb1f9f+0(SB)
0x0004 00004 (popcount.go:6) MOVL BP,BX
0x0006 00006 (popcount.go:6) SHRL $1,BX
0x0008 00008 (popcount.go:6) ANDL $1431655765,BX
0x000e 00014 (popcount.go:6) SUBL BX,BP
0x0010 00016 (popcount.go:7) MOVL BP,AX
0x0012 00018 (popcount.go:7) ANDL $858993459,AX
0x0017 00023 (popcount.go:7) SHRL $2,BP
0x001a 00026 (popcount.go:7) ANDL $858993459,BP
0x0020 00032 (popcount.go:7) ADDL BP,AX
0x0022 00034 (popcount.go:8) MOVL AX,BX
0x0024 00036 (popcount.go:8) SHRL $4,BX
0x0027 00039 (popcount.go:8) ADDL AX,BX
0x0029 00041 (popcount.go:8) ANDL $252645135,BX
0x002f 00047 (popcount.go:8) IMULL $16843009,BX
0x0035 00053 (popcount.go:8) SHRL $24,BX
0x0038 00056 (popcount.go:8) MOVL BX,"".~r1+16(FP)
0x003c 00060 (popcount.go:8) RET ,
我知道FUNCDATA
行包含垃圾收集器的信息,但除此之外,我没有看到任何明显的区别
是什么导致这两个函数在速度上存在如此大的差异?如果您查看函数调用的伪汇编程序,您将看到调用了.s
(汇编程序)版本,并且调用开销很大,.go
版本是内联的
func S() {
pc := popCount(uint32(0))
_ = pc
}
"".S t=1 size=48 value=0 args=0x0 locals=0x10
0x0000 00000 (popcount.go:11) TEXT "".S+0(SB),$16-0
0x0000 00000 (popcount.go:11) MOVQ (TLS),CX
0x0009 00009 (popcount.go:11) CMPQ SP,(CX)
0x000c 00012 (popcount.go:11) JHI ,21
0x000e 00014 (popcount.go:11) CALL ,runtime.morestack00_noctxt(SB)
0x0013 00019 (popcount.go:11) JMP ,0
0x0015 00021 (popcount.go:11) SUBQ $16,SP
0x0019 00025 (popcount.go:11) FUNCDATA $2,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
0x0019 00025 (popcount.go:11) FUNCDATA $3,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
0x0019 00025 (popcount.go:12) MOVL $0,(SP)
0x0020 00032 (popcount.go:12) PCDATA $1,$0
0x0020 00032 (popcount.go:12) CALL ,"".popCount(SB)
0x0025 00037 (popcount.go:12) MOVL 8(SP),BX
0x0029 00041 (popcount.go:12) NOP ,
0x0029 00041 (popcount.go:14) ADDQ $16,SP
0x002d 00045 (popcount.go:14) RET ,
func S_G() {
pc := popCount_g(uint32(0))
_ = pc
}
"".S_G t=1 size=64 value=0 args=0x0 locals=0x8
0x0000 00000 (popcount.go:16) TEXT "".S_G+0(SB),4,$8-0
0x0000 00000 (popcount.go:16) SUBQ $8,SP
0x0004 00004 (popcount.go:16) FUNCDATA $2,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
0x0004 00004 (popcount.go:16) FUNCDATA $3,gclocals·3280bececceccd33cb74587feedb1f9f+0(SB)
0x0004 00004 (popcount.go:17) MOVL $0,BP
0x0006 00006 (popcount.go:17) MOVL BP,BX
0x0008 00008 (popcount.go:17) SHRL $1,BX
0x000a 00010 (popcount.go:17) ANDL $1431655765,BX
0x0010 00016 (popcount.go:17) SUBL BX,BP
0x0012 00018 (popcount.go:17) MOVL BP,AX
0x0014 00020 (popcount.go:17) ANDL $858993459,AX
0x0019 00025 (popcount.go:17) SHRL $2,BP
0x001c 00028 (popcount.go:17) ANDL $858993459,BP
0x0022 00034 (popcount.go:17) ADDL BP,AX
0x0024 00036 (popcount.go:17) MOVL AX,BX
0x0026 00038 (popcount.go:17) SHRL $4,BX
0x0029 00041 (popcount.go:17) ADDL AX,BX
0x002b 00043 (popcount.go:17) ANDL $252645135,BX
0x0031 00049 (popcount.go:17) IMULL $16843009,BX
0x0037 00055 (popcount.go:17) SHRL $24,BX
0x003a 00058 (popcount.go:19) ADDQ $8,SP
0x003e 00062 (popcount.go:19) RET ,
如果您真的想要一个popcnt,那么它在X86处理器上已经是本机指令一段时间了。您的编译器可能有它的内在特性,例如Microsoft的uuuPopCnt()(也有uuPopCnt16()和uuuPopCnt64()),或GCC的uuuuBuiltin_popcount()(或64位的uuuuBuiltin_popcountll()) 需要记住的一点是,go中的链接器比传统的C链接器聪明得多。特别是它可以转换指令、重新排序代码等,因此我建议您使用objdump查看最终的二进制文件,而不是比较go汇编程序的输出。@NickCraig-Wood非常好的建议。我会检查一下。@NickCraig Wood objdump显示程序集仍然非常相似,但是手工编写的程序集版本在调用函数之前显示callq 420610
。本机go版本在main.main中有整个程序集,因此我猜区别在于分配更多内存所需的时间,因为额外的跳转不会花费那么长的时间。有什么想法吗?@NickCraig-Wood根据peterSO的回答,看起来每次操作的呼叫开销可能会花费几纳秒。啊,好吧,现在我需要尝试学习如何内联汇编代码,如果可以的话。。。我去兔子洞!AFAIK:内联汇编代码是不可能的:-(在过去的开发列表中肯定已经讨论过了。谢谢,你是对的(我使用@NickCraig-Wood的建议使用objdump检查最终的二进制文件发现了同样的事情)是的,我知道最近英特尔和AMD处理器上的popcnt
指令。这实际上只是学习Go使用的基于Plan9的汇编语言的一个练习,我在测试时遇到了这种性能差异。我感觉某些指令(如popcnt
)Plan9/Go汇编语言不支持,并且似乎没有可用的popcnt
或类似函数。我上次检查时不支持它,但您可以自己对指令进行编码,并在汇编中包含十六进制格式的文本字节,例如字节$0xf3;字节$0x48;字节$0x0f;字节$0xb8;字节$0x16//popcntQ(SI),DX