Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/363.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 用Numpy和Cython加速距离矩阵计算_Python_Performance_Optimization_Numpy_Cython - Fatal编程技术网

Python 用Numpy和Cython加速距离矩阵计算

Python 用Numpy和Cython加速距离矩阵计算,python,performance,optimization,numpy,cython,Python,Performance,Optimization,Numpy,Cython,考虑一个维数为NxM的numpy数组。目标是计算欧氏距离矩阵D,其中每个元素D[i,j]是行i和j之间的欧氏距离。最快的方法是什么?这并不是我需要解决的问题,但它是我正在尝试做的一个很好的例子(通常,可以使用其他距离度量) 这是迄今为止我能想到的最快的: n = A.shape[0] D = np.empty((n,n)) for i in range(n): D[i] = np.sqrt(np.square(A-A[i]).sum(1)) 但这是最快的方式吗?我主要关心for循环。我

考虑一个维数为NxM的numpy数组。目标是计算欧氏距离矩阵D,其中每个元素D[i,j]是行i和j之间的欧氏距离。最快的方法是什么?这并不是我需要解决的问题,但它是我正在尝试做的一个很好的例子(通常,可以使用其他距离度量)

这是迄今为止我能想到的最快的:

n = A.shape[0]
D = np.empty((n,n))
for i in range(n):
    D[i] = np.sqrt(np.square(A-A[i]).sum(1))
但这是最快的方式吗?我主要关心for循环。我们能用Cython来打败它吗

为了避免循环,我尝试使用广播,并做了如下操作:

D = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
但事实证明这不是一个好主意,因为在构建一个维度为nxm的中间3D数组时会有一些开销,所以性能会更差

我试过Cython。但我是Cython的新手,所以我不知道我的尝试有多好:

def dist(np.ndarray[np.int32_t, ndim=2] A):
    cdef int n = A.shape[0]    
    cdef np.ndarray[np.float64_t, ndim=2] dm = np.empty((n,n), dtype=np.float64)      
    cdef int i = 0    
    for i in range(n):  
        dm[i] = np.sqrt(np.square(A-A[i]).sum(1)).astype(np.float64)              
    return dm 

上面的代码比Python的for循环慢一点。我对Cython了解不多,但我认为我至少可以实现与for循环+numpy相同的性能。我想知道,如果采用正确的方法,是否有可能实现一些显著的性能改进?或者是否有其他方法可以加快速度(不涉及并行计算)?

Cython的关键在于尽可能避免使用Python对象和函数调用,包括对numpy数组的矢量化操作。这通常意味着手工写出所有循环,并一次对单个数组元素进行操作

有一个例子介绍了将numpy代码转换为Cython并对其进行优化的过程

下面是一个更优化的Cython版距离函数的快速尝试:

import numpy as np
cimport numpy as np
cimport cython

# don't use np.sqrt - the sqrt function from the C standard library is much
# faster
from libc.math cimport sqrt

# disable checks that ensure that array indices don't go out of bounds. this is
# faster, but you'll get a segfault if you mess up your indexing.
@cython.boundscheck(False)
# this disables 'wraparound' indexing from the end of the array using negative
# indices.
@cython.wraparound(False)
def dist(double [:, :] A):

    # declare C types for as many of our variables as possible. note that we
    # don't necessarily need to assign a value to them at declaration time.
    cdef:
        # Py_ssize_t is just a special platform-specific type for indices
        Py_ssize_t nrow = A.shape[0]
        Py_ssize_t ncol = A.shape[1]
        Py_ssize_t ii, jj, kk

        # this line is particularly expensive, since creating a numpy array
        # involves unavoidable Python API overhead
        np.ndarray[np.float64_t, ndim=2] D = np.zeros((nrow, nrow), np.double)

        double tmpss, diff

    # another advantage of using Cython rather than broadcasting is that we can
    # exploit the symmetry of D by only looping over its upper triangle
    for ii in range(nrow):
        for jj in range(ii + 1, nrow):
            # we use tmpss to accumulate the SSD over each pair of rows
            tmpss = 0
            for kk in range(ncol):
                diff = A[ii, kk] - A[jj, kk]
                tmpss += diff * diff
            tmpss = sqrt(tmpss)
            D[ii, jj] = tmpss
            D[jj, ii] = tmpss  # because D is symmetric

    return D
from cython.parallel cimport prange

...

for ii in prange(nrow, nogil=True, schedule='guided'):
...
我把它保存在一个名为
fastdist.pyx
的文件中。我们可以使用
pyximport
简化构建过程:

import pyximport
pyximport.install()
import fastdist
import numpy as np

A = np.random.randn(100, 200)

D1 = np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
D2 = fastdist.dist(A)

print np.allclose(D1, D2)
# True
所以它至少起作用了。让我们使用
%timeit
魔术进行一些基准测试:

%timeit np.sqrt(np.square(A[np.newaxis,:,:]-A[:,np.newaxis,:]).sum(2))
# 100 loops, best of 3: 10.6 ms per loop

%timeit fastdist.dist(A)
# 100 loops, best of 3: 1.21 ms per loop
9倍的加速是不错的,但并不能真正改变游戏规则。但是,正如您所说,广播方法的最大问题是构建中间阵列的内存需求

A2 = np.random.randn(1000, 2000)
%timeit fastdist.dist(A2)
# 1 loops, best of 3: 1.36 s per loop
我不建议尝试使用广播

我们可以做的另一件事是使用
prange
函数在最外层的循环上进行并行化:

import numpy as np
cimport numpy as np
cimport cython

# don't use np.sqrt - the sqrt function from the C standard library is much
# faster
from libc.math cimport sqrt

# disable checks that ensure that array indices don't go out of bounds. this is
# faster, but you'll get a segfault if you mess up your indexing.
@cython.boundscheck(False)
# this disables 'wraparound' indexing from the end of the array using negative
# indices.
@cython.wraparound(False)
def dist(double [:, :] A):

    # declare C types for as many of our variables as possible. note that we
    # don't necessarily need to assign a value to them at declaration time.
    cdef:
        # Py_ssize_t is just a special platform-specific type for indices
        Py_ssize_t nrow = A.shape[0]
        Py_ssize_t ncol = A.shape[1]
        Py_ssize_t ii, jj, kk

        # this line is particularly expensive, since creating a numpy array
        # involves unavoidable Python API overhead
        np.ndarray[np.float64_t, ndim=2] D = np.zeros((nrow, nrow), np.double)

        double tmpss, diff

    # another advantage of using Cython rather than broadcasting is that we can
    # exploit the symmetry of D by only looping over its upper triangle
    for ii in range(nrow):
        for jj in range(ii + 1, nrow):
            # we use tmpss to accumulate the SSD over each pair of rows
            tmpss = 0
            for kk in range(ncol):
                diff = A[ii, kk] - A[jj, kk]
                tmpss += diff * diff
            tmpss = sqrt(tmpss)
            D[ii, jj] = tmpss
            D[jj, ii] = tmpss  # because D is symmetric

    return D
from cython.parallel cimport prange

...

for ii in prange(nrow, nogil=True, schedule='guided'):
...
为了编译并行版本,您需要告诉编译器启用OpenMP。我还没有弄明白如何使用
pyximport
来实现这一点,但是如果您使用
gcc
,您可以像这样手动编译它:

$ cython fastdist.pyx
$ gcc -shared -pthread -fPIC -fwrapv -fopenmp -O3 \
   -Wall -fno-strict-aliasing  -I/usr/include/python2.7 -o fastdist.so fastdist.c
使用并行性,使用8个线程:

%timeit D2 = fastdist.dist_parallel(A2)
1 loops, best of 3: 509 ms per loop

N和M有多大?在Python中执行N循环而不是NumPy当然会降低速度,但这并不像执行NxM循环那么糟糕。它真的太慢了吗,还是你只是在为它进行优化?同样,在Cython中编写一个ufunc,然后在
a
,而不是将整个循环放在Cython中,可能会更容易。如果没有其他错误的话,这样做就不会出错…,所以这可能是一个非常快的选项。@user2357112,是的,刚刚尝试了scipy,非常快,谢谢。但我仍然需要弄清楚如何实现它,因为这只是我遇到的一个更一般问题的一个例子。至于Cython,如果你使用它,你可能想自己做数学,而不是调用NumPy例程。当您已经在编写编译成C的代码时,NumPy矢量化并没有多大帮助。非常感谢!这看起来真的很有希望@ojy很高兴你觉得它很有用。我刚刚意识到我的初始版本效率很低,因为它在
D
中的每个元素上循环,而不仅仅是上三角。更新后的单线程版本的速度又快了一倍。是的,我注意到了,但还没有机会尝试,不能等到星期一:)非常感谢!终于试过了!工作得很好!我看了更多关于如何进一步改进并行化的内容,因为它只带来了大约3倍的加速,即使是在我可以访问的24个内核上。Saullo Castro从这个问题中找到了一个非常有用的答案。其思想是有一个单独的并行调用的例程,并且只传递指向数据数组的指针。它给了我额外5倍的加速。@ojy我有点惊讶它给您带来了如此大的性能差异,尽管我认为这可能取决于一系列其他因素,包括您的编译器。另一件有时有用的事情是将距离函数声明为
inline
(例如
cdef inline void mydist(…)nogil:
),这会给C编译器一个额外的提示来优化该函数(通常是通过将函数的代码替换到其调用者中)。您还可以尝试改变正在使用的OpenMP线程的数量-24可能过多,除非
A
非常大。