Python einsum和matmul之间的性能差异
相关问题 我试图用python代码来利用张量收缩的对称性,Python einsum和matmul之间的性能差异,python,numpy,tensor,Python,Numpy,Tensor,相关问题 我试图用python代码来利用张量收缩的对称性,A[A,b]b[b,c,d]=c[A,c,d]当b[b,c,d]=b[b,d,c]因此c[A,c,d]=c[A,d,c]。(假定爱因斯坦求和约定,即重复的b表示对其求和) 通过以下代码 import numpy as np import time # A[a,b] * B[b,c,d] = C[a,c,d] na = nb = nc = nd = 100 A = np.random.random((na,nb)) B = np.ran
A[A,b]b[b,c,d]=c[A,c,d]
当b[b,c,d]=b[b,d,c]
因此c[A,c,d]=c[A,d,c]
。(假定爱因斯坦求和约定,即重复的b
表示对其求和)
通过以下代码
import numpy as np
import time
# A[a,b] * B[b,c,d] = C[a,c,d]
na = nb = nc = nd = 100
A = np.random.random((na,nb))
B = np.random.random((nb,nc,nd))
C = np.zeros((na,nc,nd))
C2= np.zeros((na,nc,nd))
C3= np.zeros((na,nc,nd))
# symmetrize B
for c in range(nc):
for d in range(c):
B[:,c,d] = B[:,d,c]
start_time = time.time()
C2 = np.einsum('ab,bcd->acd', A, B)
finish_time = time.time()
print('time einsum', finish_time - start_time )
start_time = time.time()
for c in range(nc):
# c+1 is needed, since range(0) will be skipped
for d in range(c+1):
#C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
for c in range(nc):
for d in range(c+1,nd):
C3[:,c,d] = C3[:,d,c]
finish_time = time.time()
print( 'time partial einsum', finish_time - start_time )
for a in range(int(na/10)):
for c in range(int(nc/10)):
for d in range(int(nd/10)):
if abs((C3-C2)[a,c,d])> 1.0e-12:
print('warning', a,c,d, (C3-C2)[a,c,d])
在我看来,np.matmul
比np.einsum
快,例如,通过使用np.matmul
,我得到了
time einsum 0.07406115531921387
time partial einsum 0.0553278923034668
time einsum 0.0751657485961914
time partial einsum 0.11624622344970703
通过使用np.einsum
,我得到了
time einsum 0.07406115531921387
time partial einsum 0.0553278923034668
time einsum 0.0751657485961914
time partial einsum 0.11624622344970703
上述性能差异是否一般?我经常认为
einsum
是理所当然的。作为一般规则,我希望matmul
更快,尽管在更简单的情况下,einsum
实际使用matmul
但这是我的时间安排
In [20]: C2 = np.einsum('ab,bcd->acd', A, B)
In [21]: timeit C2 = np.einsum('ab,bcd->acd', A, B)
126 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
您可以尝试使用einsum
:
In [22]: %%timeit
...: for c in range(nc):
...: # c+1 is needed, since range(0) will be skipped
...: for d in range(c+1):
...: C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
...: #C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
...:
...: for c in range(nc):
...: for d in range(c+1,nd):
...: C3[:,c,d] = C3[:,d,c]
...:
128 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
与matmul
相同:
In [23]: %%timeit
...: for c in range(nc):
...: # c+1 is needed, since range(0) will be skipped
...: for d in range(c+1):
...: #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
...: C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
...:
...: for c in range(nc):
...: for d in range(c+1,nd):
...: C3[:,c,d] = C3[:,d,c]
...:
81.3 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [24]: C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
In [25]: np.allclose(C2,C4)
Out[25]: True
In [26]: timeit C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
14.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
和直接matmul
:
In [23]: %%timeit
...: for c in range(nc):
...: # c+1 is needed, since range(0) will be skipped
...: for d in range(c+1):
...: #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
...: C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
...:
...: for c in range(nc):
...: for d in range(c+1,nd):
...: C3[:,c,d] = C3[:,d,c]
...:
81.3 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [24]: C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
In [25]: np.allclose(C2,C4)
Out[25]: True
In [26]: timeit C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
14.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
einsum
还有一个optimize
标志。我认为只有3个或更多的论点才重要,但这似乎有助于:
In [27]: timeit C2 = np.einsum('ab,bcd->acd', A, B, optimize=True)
20.3 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
有时,当阵列非常大时,某些迭代会更快,因为它降低了内存管理的复杂性。但我认为在尝试利用对称性时不值得。其他SO已经表明,在某些情况下,
matmul
可以检测对称性,并使用自定义的BLAS
调用,但我认为这里不是这种情况(如果不进行昂贵的比较,它无法在B
中检测对称性)。optimize=True
允许np.einsum以一些开销为代价分派到BLAS功能。optimize=True
的目标是始终做正确的事情,并在适当的情况下使用matmul/dot。然而,它不执行循环GEMM操作(甚至更高的逻辑开销),通常可以手动调整操作来击败它。