Python einsum和matmul之间的性能差异

Python einsum和matmul之间的性能差异,python,numpy,tensor,Python,Numpy,Tensor,相关问题 我试图用python代码来利用张量收缩的对称性,A[A,b]b[b,c,d]=c[A,c,d]当b[b,c,d]=b[b,d,c]因此c[A,c,d]=c[A,d,c]。(假定爱因斯坦求和约定,即重复的b表示对其求和) 通过以下代码 import numpy as np import time # A[a,b] * B[b,c,d] = C[a,c,d] na = nb = nc = nd = 100 A = np.random.random((na,nb)) B = np.ran

相关问题

我试图用python代码来利用张量收缩的对称性,
A[A,b]b[b,c,d]=c[A,c,d]
b[b,c,d]=b[b,d,c]
因此
c[A,c,d]=c[A,d,c]
。(假定爱因斯坦求和约定,即重复的
b
表示对其求和)

通过以下代码

import numpy as np
import time

# A[a,b] * B[b,c,d] = C[a,c,d]
na = nb = nc = nd = 100

A = np.random.random((na,nb))
B = np.random.random((nb,nc,nd))
C = np.zeros((na,nc,nd))
C2= np.zeros((na,nc,nd))
C3= np.zeros((na,nc,nd))

# symmetrize B
for c in range(nc):
    for d in range(c):
        B[:,c,d] = B[:,d,c]


start_time = time.time()
C2 = np.einsum('ab,bcd->acd', A, B)
finish_time = time.time()
print('time einsum', finish_time - start_time )


start_time = time.time()
for c in range(nc):
# c+1 is needed, since range(0) will be skipped
    for d in range(c+1):
       #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
       C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
       
for c in range(nc):
    for d in range(c+1,nd):
        C3[:,c,d] = C3[:,d,c] 

finish_time = time.time()
print( 'time partial einsum', finish_time - start_time )


for a in range(int(na/10)):
    for c in range(int(nc/10)):
        for d in range(int(nd/10)):
            if abs((C3-C2)[a,c,d])> 1.0e-12:
                print('warning', a,c,d, (C3-C2)[a,c,d])

在我看来,
np.matmul
np.einsum
快,例如,通过使用
np.matmul
,我得到了

time einsum 0.07406115531921387
time partial einsum 0.0553278923034668
time einsum 0.0751657485961914
time partial einsum 0.11624622344970703
通过使用
np.einsum
,我得到了

time einsum 0.07406115531921387
time partial einsum 0.0553278923034668
time einsum 0.0751657485961914
time partial einsum 0.11624622344970703

上述性能差异是否一般?我经常认为
einsum
是理所当然的。

作为一般规则,我希望
matmul
更快,尽管在更简单的情况下,
einsum
实际使用
matmul

但这是我的时间安排

In [20]: C2 = np.einsum('ab,bcd->acd', A, B)
In [21]: timeit C2 = np.einsum('ab,bcd->acd', A, B)
126 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
您可以尝试使用
einsum

In [22]: %%timeit
    ...: for c in range(nc):
    ...: # c+1 is needed, since range(0) will be skipped
    ...:     for d in range(c+1):
    ...:        C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
    ...:        #C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
    ...: 
    ...: for c in range(nc):
    ...:     for d in range(c+1,nd):
    ...:         C3[:,c,d] = C3[:,d,c]
    ...: 
128 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
matmul
相同:

In [23]: %%timeit
    ...: for c in range(nc):
    ...: # c+1 is needed, since range(0) will be skipped
    ...:     for d in range(c+1):
    ...:        #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
    ...:        C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
    ...: 
    ...: for c in range(nc):
    ...:     for d in range(c+1,nd):
    ...:         C3[:,c,d] = C3[:,d,c]
    ...: 
81.3 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [24]: C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
In [25]: np.allclose(C2,C4)
Out[25]: True
In [26]: timeit C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
14.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
和直接
matmul

In [23]: %%timeit
    ...: for c in range(nc):
    ...: # c+1 is needed, since range(0) will be skipped
    ...:     for d in range(c+1):
    ...:        #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
    ...:        C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
    ...: 
    ...: for c in range(nc):
    ...:     for d in range(c+1,nd):
    ...:         C3[:,c,d] = C3[:,d,c]
    ...: 
81.3 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [24]: C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
In [25]: np.allclose(C2,C4)
Out[25]: True
In [26]: timeit C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
14.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
einsum
还有一个
optimize
标志。我认为只有3个或更多的论点才重要,但这似乎有助于:

In [27]: timeit C2 = np.einsum('ab,bcd->acd', A, B, optimize=True)
20.3 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

有时,当阵列非常大时,某些迭代会更快,因为它降低了内存管理的复杂性。但我认为在尝试利用对称性时不值得。其他SO已经表明,在某些情况下,
matmul
可以检测对称性,并使用自定义的
BLAS
调用,但我认为这里不是这种情况(如果不进行昂贵的比较,它无法在
B
中检测对称性)。

optimize=True
允许np.einsum以一些开销为代价分派到BLAS功能。
optimize=True
的目标是始终做正确的事情,并在适当的情况下使用matmul/dot。然而,它不执行循环GEMM操作(甚至更高的逻辑开销),通常可以手动调整操作来击败它。