Python 为什么这个numba代码比numpy代码慢6倍?

Python 为什么这个numba代码比numpy代码慢6倍?,python,numpy,numba,Python,Numpy,Numba,是否有任何原因导致以下代码在2s中运行 def euclidean_distance_square(x1, x2): return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1) 当以下numba代码在12秒内运行时 @jit(nopython=True) def euclidean_distance_square(x1

是否有任何原因导致以下代码在2s中运行

def euclidean_distance_square(x1, x2):
    return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1)
当以下numba代码在12秒内运行时

@jit(nopython=True)
def euclidean_distance_square(x1, x2):
   return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1)
My x1是维度(1,512)的矩阵,x2是维度(3000000,512)的矩阵。真奇怪,麻麻的速度竟然慢得多。我用错了吗

我真的需要加快速度,因为我需要运行这个函数300万次,而2s仍然太慢

我需要在CPU上运行它,因为你可以看到x2的尺寸太大,无法加载到GPU(或者至少是我的GPU),内存不足

真奇怪,麻麻的速度竟然慢得多

这并不奇怪。当您在numba函数中调用NumPy函数时,您将调用这些函数的numba版本。它们可以更快、更慢,或者与NumPy版本一样快。你可能是幸运的,也可能是不幸的。但即使在numba函数中,您仍然会创建很多临时数组,因为您使用NumPy函数(一个临时数组用于点结果,一个临时数组用于每个平方和,一个临时数组用于点加第一个和),所以您不会利用numba的可能性

我用错了吗

基本上:是的

我真的需要加快速度

好的,我试试看

让我们从沿轴1展开平方和开始调用:

import numba as nb

@nb.njit
def sum_squares_2d_array_along_axis1(arr):
    res = np.empty(arr.shape[0], dtype=arr.dtype)
    for o_idx in range(arr.shape[0]):
        sum_ = 0
        for i_idx in range(arr.shape[1]):
            sum_ += arr[o_idx, i_idx] * arr[o_idx, i_idx]
        res[o_idx] = sum_
    return res


@nb.njit
def euclidean_distance_square_numba_v1(x1, x2):
    return -2 * np.dot(x1, x2.T) + np.expand_dims(sum_squares_2d_array_along_axis1(x1), axis=1) + sum_squares_2d_array_along_axis1(x2)
在我的电脑上,它已经比NumPy代码快了2倍,比你原来的Numba代码快了近10倍

从经验来看,获得比NumPy快2倍的速度通常是极限(至少如果NumPy版本不是不必要的复杂或低效),但是您可以通过展开所有内容来挤出更多:

import numba as nb

@nb.njit
def euclidean_distance_square_numba_v2(x1, x2):
    f1 = 0.
    for i_idx in range(x1.shape[1]):
        f1 += x1[0, i_idx] * x1[0, i_idx]

    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            val_from_x2 = x2[o_idx, i_idx]
            val += (-2) * x1[0, i_idx] * val_from_x2 + val_from_x2 * val_from_x2
        val += f1
        res[o_idx] = val
    return res
但这只比最新方法提高了约10-20%

此时,您可能会意识到您可以简化代码(即使它可能不会加快速度):

是的,这看起来很直截了当,速度也不是很慢

然而,在所有的兴奋中,我忘了提到明显的解决方案:它有一个
sqeuclidean
(squaredeuclidean distance)选项:

它实际上并不比numba快,但它不需要编写自己的函数就可以使用

测验 测试正确性并进行预热:

x1 = np.array([[1.,2,3]])
x2 = np.array([[1.,2,3], [2,3,4], [3,4,5], [4,5,6], [5,6,7]])

res1 = euclidean_distance_square(x1, x2)
res2 = euclidean_distance_square_numba_original(x1, x2)
res3 = euclidean_distance_square_numba_v1(x1, x2)
res4 = euclidean_distance_square_numba_v2(x1, x2)
res5 = euclidean_distance_square_numba_v3(x1, x2)
np.testing.assert_array_equal(res1, res2)
np.testing.assert_array_equal(res1, res3)
np.testing.assert_array_equal(res1[0], res4)
np.testing.assert_array_equal(res1[0], res5)
np.testing.assert_almost_equal(res1, distance.cdist(x1, x2, metric='sqeuclidean'))
时间:

x1 = np.random.random((1, 512))
x2 = np.random.random((1000000, 512))

%timeit euclidean_distance_square(x1, x2)
# 2.09 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_original(x1, x2)
# 10.9 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v1(x1, x2)
# 907 ms ± 7.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v2(x1, x2)
# 715 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v3(x1, x2)
# 731 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit distance.cdist(x1, x2, metric='sqeuclidean')
# 706 ms ± 4.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

注意:如果您有整数数组,您可能希望将numba函数中硬编码的
0.0
更改为
0

,尽管@MSeifert的答案使得这个答案非常过时,但我仍在发布它,因为它更详细地解释了为什么numba版本比numpy版本慢

正如我们将看到的,罪魁祸首是numpy和numba的不同内存访问模式

我们可以用一个简单得多的函数来重现该行为:

import numpy as np
import numba as nb

def just_sum(x2):
    return np.sum(x2, axis=1)

@nb.jit('double[:](double[:, :])', nopython=True)
def nb_just_sum(x2):
    return np.sum(x2, axis=1)

x2=np.random.random((2048,2048))
现在是时间安排:

>>> %timeit just_sum(x)
2.33 ms ± 71.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit nb_just_sum(x)
33.7 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
这意味着numpy的速度快了15倍

在编译带有注释的numba代码时(例如,
numba--annotate html sum.html numba_sum.py
),我们可以看到numba是如何执行求和的(参见附录中的整个求和列表):

  • 初始化结果列
  • 将整个第一列添加到结果列
  • 将整个第二列添加到结果列
  • 等等
  • 这种方法的问题是什么?内存布局!数组是按行主顺序存储的,因此按列读取会导致比按行读取更多的缓存未命中(numpy就是这样做的)。有一种方法可以解释可能的缓存效果

    正如我们所看到的,numba的sum实现还不是很成熟。然而,从上述考虑来看,numba实施可能对列主订单(即转置矩阵)具有竞争力:

    确实如此

    正如@MSeifert的代码所示,numba的主要优点是,在它的帮助下,我们可以减少临时numpy数组的数量。然而,有些看起来容易的事情根本就不容易,一个天真的解决方案可能非常糟糕。构建求和就是这样一种操作——人们不应该认为一个简单的循环就足够了——例如,请参见


    清单A总结:

     Function name: array_sum_impl_axis
    in file: /home/ed/anaconda3/lib/python3.6/site-packages/numba/targets/arraymath.py
    with signature: (array(float64, 2d, A), int64) -> array(float64, 1d, C)
    show numba IR
    194:    def array_sum_impl_axis(arr, axis):
    195:        ndim = arr.ndim
    196:    
    197:        if not is_axis_const:
    198:            # Catch where axis is negative or greater than 3.
    199:            if axis < 0 or axis > 3:
    200:                raise ValueError("Numba does not support sum with axis"
    201:                                 "parameter outside the range 0 to 3.")
    202:    
    203:        # Catch the case where the user misspecifies the axis to be
    204:        # more than the number of the array's dimensions.
    205:        if axis >= ndim:
    206:            raise ValueError("axis is out of bounds for array")
    207:    
    208:        # Convert the shape of the input array to a list.
    209:        ashape = list(arr.shape)
    210:        # Get the length of the axis dimension.
    211:        axis_len = ashape[axis]
    212:        # Remove the axis dimension from the list of dimensional lengths.
    213:        ashape.pop(axis)
    214:        # Convert this shape list back to a tuple using above intrinsic.
    215:        ashape_without_axis = _create_tuple_result_shape(ashape, arr.shape)
    216:        # Tuple needed here to create output array with correct size.
    217:        result = np.full(ashape_without_axis, zero, type(zero))
    218:    
    219:        # Iterate through the axis dimension.
    220:        for axis_index in range(axis_len):
    221:            if is_axis_const:
    222:                # constant specialized version works for any valid axis value
    223:                index_tuple_generic = _gen_index_tuple(arr.shape, axis_index,
    224:                                                       const_axis_val)
    225:                result += arr[index_tuple_generic]
    226:            else:
    227:                # Generate a tuple used to index the input array.
    228:                # The tuple is ":" in all dimensions except the axis
    229:                # dimension where it is "axis_index".
    230:                if axis == 0:
    231:                    index_tuple1 = _gen_index_tuple(arr.shape, axis_index, 0)
    232:                    result += arr[index_tuple1]
    233:                elif axis == 1:
    234:                    index_tuple2 = _gen_index_tuple(arr.shape, axis_index, 1)
    235:                    result += arr[index_tuple2]
    236:                elif axis == 2:
    237:                    index_tuple3 = _gen_index_tuple(arr.shape, axis_index, 2)
    238:                    result += arr[index_tuple3]
    239:                elif axis == 3:
    240:                    index_tuple4 = _gen_index_tuple(arr.shape, axis_index, 3)
    241:                    result += arr[index_tuple4]
    242:    
    243:        return result 
    
    函数名称:数组\求和\执行轴
    文件中:/home/ed/anaconda3/lib/python3.6/site-packages/numba/targets/arraymath.py
    带签名:(数组(float64,2d,A),int64)->数组(float64,1d,C)
    秀麻木
    194:def数组和执行轴(arr,轴):
    195:ndim=arr.ndim
    196:    
    197:如果不是轴常数:
    198:#捕捉轴为负或大于3的位置。
    199:如果轴<0或轴>3:
    200:提升值错误(“Numba不支持轴和”
    201:“参数超出范围0到3。”)
    202:    
    203:#捕捉用户错误指定要使用的轴的情况
    204:#大于数组的维数。
    205:如果轴>=ndim:
    206:提升值错误(“轴超出数组的界限”)
    207:    
    208:#将输入数组的形状转换为列表。
    209:ashape=列表(arr.shape)
    210:#获取轴尺寸的长度。
    211:axis_len=ashape[axis]
    212:#从尺寸长度列表中删除轴尺寸。
    213:ashape.pop(轴)
    214:#使用上述内在函数将此形状列表转换回元组。
    215:ashape\u无轴=\u创建\u元组\u结果\u形状(ashape,arr.shape)
    216:#此处需要元组来创建大小正确的输出数组。
    217:结果=np.full(无轴的ashape_,零,类型(零))
    218:    
    219:#迭代轴尺寸。
    220:对于范围内的轴索引(轴长度):
    221:如果是轴常数:
    222:#常量专用版本适用于任何有效的轴值
    223:index\u tuple\u generic=\u gen\u index\u tuple(arr.shape,axis\u index,
    224:
    
    >>> %timeit just_sum(x)
    2.33 ms ± 71.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    >>> %timeit nb_just_sum(x)
    33.7 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    >>> %timeit just_sum(x.T)
    3.09 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    >>> %timeit nb_just_sum(x.T)
    3.58 ms ± 45.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
     Function name: array_sum_impl_axis
    in file: /home/ed/anaconda3/lib/python3.6/site-packages/numba/targets/arraymath.py
    with signature: (array(float64, 2d, A), int64) -> array(float64, 1d, C)
    show numba IR
    194:    def array_sum_impl_axis(arr, axis):
    195:        ndim = arr.ndim
    196:    
    197:        if not is_axis_const:
    198:            # Catch where axis is negative or greater than 3.
    199:            if axis < 0 or axis > 3:
    200:                raise ValueError("Numba does not support sum with axis"
    201:                                 "parameter outside the range 0 to 3.")
    202:    
    203:        # Catch the case where the user misspecifies the axis to be
    204:        # more than the number of the array's dimensions.
    205:        if axis >= ndim:
    206:            raise ValueError("axis is out of bounds for array")
    207:    
    208:        # Convert the shape of the input array to a list.
    209:        ashape = list(arr.shape)
    210:        # Get the length of the axis dimension.
    211:        axis_len = ashape[axis]
    212:        # Remove the axis dimension from the list of dimensional lengths.
    213:        ashape.pop(axis)
    214:        # Convert this shape list back to a tuple using above intrinsic.
    215:        ashape_without_axis = _create_tuple_result_shape(ashape, arr.shape)
    216:        # Tuple needed here to create output array with correct size.
    217:        result = np.full(ashape_without_axis, zero, type(zero))
    218:    
    219:        # Iterate through the axis dimension.
    220:        for axis_index in range(axis_len):
    221:            if is_axis_const:
    222:                # constant specialized version works for any valid axis value
    223:                index_tuple_generic = _gen_index_tuple(arr.shape, axis_index,
    224:                                                       const_axis_val)
    225:                result += arr[index_tuple_generic]
    226:            else:
    227:                # Generate a tuple used to index the input array.
    228:                # The tuple is ":" in all dimensions except the axis
    229:                # dimension where it is "axis_index".
    230:                if axis == 0:
    231:                    index_tuple1 = _gen_index_tuple(arr.shape, axis_index, 0)
    232:                    result += arr[index_tuple1]
    233:                elif axis == 1:
    234:                    index_tuple2 = _gen_index_tuple(arr.shape, axis_index, 1)
    235:                    result += arr[index_tuple2]
    236:                elif axis == 2:
    237:                    index_tuple3 = _gen_index_tuple(arr.shape, axis_index, 2)
    238:                    result += arr[index_tuple3]
    239:                elif axis == 3:
    240:                    index_tuple4 = _gen_index_tuple(arr.shape, axis_index, 3)
    241:                    result += arr[index_tuple4]
    242:    
    243:        return result 
    
    import llvmlite.binding as llvm
    llvm.set_option('', '--debug-only=loop-vectorize')
    
    @nb.njit
    def euclidean_distance_square_numba_v3(x1, x2):
        res = np.empty(x2.shape[0], dtype=x2.dtype)
        for o_idx in range(x2.shape[0]):
            val = 0
            for i_idx in range(x2.shape[1]):
                tmp = x1[0, i_idx] - x2[o_idx, i_idx]
                val += tmp * tmp
            res[o_idx] = val
        return res
    
    @nb.njit(fastmath=True)
    def euclidean_distance_square_numba_v4(x1, x2):
        res = np.empty(x2.shape[0], dtype=x2.dtype)
        for o_idx in range(x2.shape[0]):
            val = 0.
            for i_idx in range(x2.shape[1]):
                tmp = x1[0, i_idx] - x2[o_idx, i_idx]
                val += tmp * tmp
            res[o_idx] = val
        return res
    
    @nb.njit(fastmath=True,parallel=True)
    def euclidean_distance_square_numba_v5(x1, x2):
        res = np.empty(x2.shape[0], dtype=x2.dtype)
        for o_idx in nb.prange(x2.shape[0]):
            val = 0.
            for i_idx in range(x2.shape[1]):
                tmp = x1[0, i_idx] - x2[o_idx, i_idx]
                val += tmp * tmp
            res[o_idx] = val
        return res
    
    float64
    x1 = np.random.random((1, 512))
    x2 = np.random.random((1000000, 512))
    
    0.42 v3 @MSeifert
    0.25 v4
    0.18 v5 parallel-version
    0.48 distance.cdist
    
    float32
    x1 = np.random.random((1, 512)).astype(np.float32)
    x2 = np.random.random((1000000, 512)).astype(np.float32)
    
    0.09 v5
    
    @nb.njit('double[:](double[:, ::1],double[:, ::1])',fastmath=True)
    def euclidean_distance_square_numba_v6(x1, x2):
        res = np.empty(x2.shape[0], dtype=x2.dtype)
        for o_idx in range(x2.shape[0]):
            val = 0.
            for i_idx in range(x2.shape[1]):
                tmp = x1[0, i_idx] - x2[o_idx, i_idx]
                val += tmp * tmp
            res[o_idx] = val
        return res
    
    @nb.njit('double[:](double[:, :],double[:, :])',fastmath=True)
    def euclidean_distance_square_numba_v7(x1, x2):
        res = np.empty(x2.shape[0], dtype=x2.dtype)
        for o_idx in range(x2.shape[0]):
            val = 0.
            for i_idx in range(x2.shape[1]):
                tmp = x1[0, i_idx] - x2[o_idx, i_idx]
                val += tmp * tmp
            res[o_idx] = val
        return res