Python einsum和matmul之间的性能差异_Python_Numpy_Tensor

Python einsum和matmul之间的性能差异

python numpy

Python einsum和matmul之间的性能差异,python,numpy,tensor,Python,Numpy,Tensor,相关问题我试图用python代码来利用张量收缩的对称性，A[A，b]b[b，c，d]=c[A，c，d]当b[b，c，d]=b[b，d，c]因此c[A，c，d]=c[A，d，c]。（假定爱因斯坦求和约定，即重复的b表示对其求和）通过以下代码 import numpy as np import time # A[a,b] * B[b,c,d] = C[a,c,d] na = nb = nc = nd = 100 A = np.random.random((na,nb)) B = np.ran

相关问题

我试图用python代码来利用张量收缩的对称性，

A[A，b]b[b，c，d]=c[A，c，d]

当

b[b，c，d]=b[b，d，c]

因此

c[A，c，d]=c[A，d，c]

。（假定爱因斯坦求和约定，即重复的

表示对其求和）

通过以下代码

import numpy as np
import time

# A[a,b] * B[b,c,d] = C[a,c,d]
na = nb = nc = nd = 100

A = np.random.random((na,nb))
B = np.random.random((nb,nc,nd))
C = np.zeros((na,nc,nd))
C2= np.zeros((na,nc,nd))
C3= np.zeros((na,nc,nd))

# symmetrize B
for c in range(nc):
    for d in range(c):
        B[:,c,d] = B[:,d,c]


start_time = time.time()
C2 = np.einsum('ab,bcd->acd', A, B)
finish_time = time.time()
print('time einsum', finish_time - start_time )


start_time = time.time()
for c in range(nc):
# c+1 is needed, since range(0) will be skipped
    for d in range(c+1):
       #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
       C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
       
for c in range(nc):
    for d in range(c+1,nd):
        C3[:,c,d] = C3[:,d,c] 

finish_time = time.time()
print( 'time partial einsum', finish_time - start_time )


for a in range(int(na/10)):
    for c in range(int(nc/10)):
        for d in range(int(nd/10)):
            if abs((C3-C2)[a,c,d])> 1.0e-12:
                print('warning', a,c,d, (C3-C2)[a,c,d])

在我看来，

np.matmul

比

np.einsum

快，例如，通过使用

np.matmul

，我得到了

time einsum 0.07406115531921387
time partial einsum 0.0553278923034668

time einsum 0.0751657485961914
time partial einsum 0.11624622344970703

通过使用

np.einsum

，我得到了

time einsum 0.07406115531921387
time partial einsum 0.0553278923034668

time einsum 0.0751657485961914
time partial einsum 0.11624622344970703

上述性能差异是否一般？我经常认为

einsum

是理所当然的。

作为一般规则，我希望

matmul

更快，尽管在更简单的情况下，

einsum

实际使用

matmul

但这是我的时间安排

In [20]: C2 = np.einsum('ab,bcd->acd', A, B)
In [21]: timeit C2 = np.einsum('ab,bcd->acd', A, B)
126 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

您可以尝试使用

einsum

：

In [22]: %%timeit
    ...: for c in range(nc):
    ...: # c+1 is needed, since range(0) will be skipped
    ...:     for d in range(c+1):
    ...:        C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
    ...:        #C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
    ...: 
    ...: for c in range(nc):
    ...:     for d in range(c+1,nd):
    ...:         C3[:,c,d] = C3[:,d,c]
    ...: 
128 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

与

matmul

相同：

In [23]: %%timeit
    ...: for c in range(nc):
    ...: # c+1 is needed, since range(0) will be skipped
    ...:     for d in range(c+1):
    ...:        #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
    ...:        C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
    ...: 
    ...: for c in range(nc):
    ...:     for d in range(c+1,nd):
    ...:         C3[:,c,d] = C3[:,d,c]
    ...: 
81.3 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
In [25]: np.allclose(C2,C4)
Out[25]: True
In [26]: timeit C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
14.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

和直接

matmul

：

In [23]: %%timeit
    ...: for c in range(nc):
    ...: # c+1 is needed, since range(0) will be skipped
    ...:     for d in range(c+1):
    ...:        #C3[:,c,d] = np.einsum('ab,b->a', A[:,:],B[:,c,d] )
    ...:        C3[:,c,d] = np.matmul(A[:,:],B[:,c,d] )
    ...: 
    ...: for c in range(nc):
    ...:     for d in range(c+1,nd):
    ...:         C3[:,c,d] = C3[:,d,c]
    ...: 
81.3 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
In [25]: np.allclose(C2,C4)
Out[25]: True
In [26]: timeit C4 = np.matmul(A, B.reshape(100,-1)).reshape(100,100,100)
14.9 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

einsum

还有一个

optimize

标志。我认为只有3个或更多的论点才重要，但这似乎有助于：

In [27]: timeit C2 = np.einsum('ab,bcd->acd', A, B, optimize=True)
20.3 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

有时，当阵列非常大时，某些迭代会更快，因为它降低了内存管理的复杂性。但我认为在尝试利用对称性时不值得。其他SO已经表明，在某些情况下，

matmul

可以检测对称性，并使用自定义的

BLAS

调用，但我认为这里不是这种情况（如果不进行昂贵的比较，它无法在

中检测对称性）。

optimize=True

允许np.einsum以一些开销为代价分派到BLAS功能。

optimize=True

的目标是始终做正确的事情，并在适当的情况下使用matmul/dot。然而，它不执行循环GEMM操作（甚至更高的逻辑开销），通常可以手动调整操作来击败它。