改进python numpy代码的运行时_Python_Arrays_Performance_Numpy_Numba

改进python numpy代码的运行时

python arrays performance numpy

改进python numpy代码的运行时,python,arrays,performance,numpy,numba,Python,Arrays,Performance,Numpy,Numba,我有一个代码，可以将垃圾箱重新分配给一个大的numpy数组。基本上，大型阵列的元素以不同的频率进行采样，最终目标是在固定仓freq\u仓重新对整个阵列进行采样。对于我拥有的数组来说，代码有点慢。有没有什么好方法可以改进这段代码的运行时？现在，一个很少的因素就可以了。可能是一些numbamagic会做的事情 import numpy as np import time division = 90 freq_division = 50 cd = 3000 boost_factor = np.rand

我有一个代码，可以将垃圾箱重新分配给一个大的

numpy

数组。基本上，大型阵列的元素以不同的频率进行采样，最终目标是在固定仓

freq\u仓

重新对整个阵列进行采样。对于我拥有的数组来说，代码有点慢。有没有什么好方法可以改进这段代码的运行时？现在，一个很少的因素就可以了。可能是一些

numba

magic会做的事情

import numpy as np
import time
division = 90
freq_division = 50
cd = 3000
boost_factor = np.random.rand(division, division, cd)
freq_bins = np.linspace(1, 60, freq_division)
es = np.random.randint(1,10, size = (cd, freq_division))
final_emit = np.zeros((division, division, freq_division))
time1 = time.time()
for i in xrange(division):
    fre_boost = np.einsum('ij, k->ijk', boost_factor[i], freq_bins)
    sky_by_cap = np.einsum('ij, jk->ijk', boost_factor[i],es)
    freq_index = np.digitize(fre_boost, freq_bins)
    freq_index_reshaped = freq_index.reshape(division*cd, -1)
    freq_index = None
    sky_by_cap_reshaped = sky_by_cap.reshape(freq_index_reshaped.shape)
    to_bin_emit = np.zeros(freq_index_reshaped.shape)
    row_index = np.arange(freq_index_reshaped.shape[0]).reshape(-1, 1)
    np.add.at(to_bin_emit, (row_index, freq_index_reshaped), sky_by_cap_reshaped)
    to_bin_emit = to_bin_emit.reshape(fre_boost.shape)
    to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
    final_emit[i] = np.sum(to_bin_emit, axis=1)
print(time.time()-time1)

我认为，用实际乘法替换

einsum

可以稍微提高性能

import numpy as np
import time
division = 90
freq_division = 50
cd = 3000
boost_factor = np.random.rand(division, division, cd)
freq_bins = np.linspace(1, 60, freq_division)
es = np.random.randint(1,10, size = (cd, freq_division))
final_emit = np.zeros((division, division, freq_division))
time1 = time.time()
for i in xrange(division):
    fre_boost = boost_factor[i][:, :, None]*freq_bins[None, None, :]
    sky_by_cap = boost_factor[i][:, :, None]*es[None, :, :]
    freq_index = np.digitize(fre_boost, freq_bins)
    freq_index_reshaped = freq_index.reshape(division*cd, -1)
    freq_index = None
    sky_by_cap_reshaped = sky_by_cap.reshape(freq_index_reshaped.shape)
    to_bin_emit = np.zeros(freq_index_reshaped.shape)
    row_index = np.arange(freq_index_reshaped.shape[0]).reshape(-1, 1)
    np.add.at(to_bin_emit, (row_index, freq_index_reshaped), sky_by_cap_reshaped)
    to_bin_emit = to_bin_emit.reshape(fre_boost.shape)
    to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
    final_emit[i] = np.sum(to_bin_emit, axis=1)
print(time.time()-time1)

您的代码在

np.add.at

中运行得相当慢，我相信使用

np.bincount

可以更快，尽管我无法让它在您拥有的多维数组中运行。这里可能有人可以添加到这一点。

这似乎是微不足道的并行化：

你有一个可以跑90次的外环
每次，除了
```
final\u emit
```
- …然后，只存储到一个唯一的行中
看起来循环中的大部分工作是numpy阵列范围的操作，这将释放GIL

因此（使用的后端口，因为您似乎在2.7上）：

如果这样做有效，有两种方法可以尝试，其中一种可能更有效。我们并不真正关心结果返回的顺序，但是

map

将它们按顺序排列。这可能会浪费一些空间和时间。我认为这不会有多大区别（大概，您的绝大多数时间都花在计算上，而不是写结果），但如果不分析代码，就很难确定。所以，有两种简单的方法可以解决这个问题

使用允许我们按照结果完成的顺序使用结果，而不是按照我们对结果排队的顺序。大概是这样的：

def dostuff(i):
    fre_boost = np.einsum('ij, k->ijk', boost_factor[i], freq_bins)
    # ...
    to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
    return i, np.sum(to_bin_emit, axis=1)

with futures.ThreadPoolExecutor(max_workers=8) as x:
    fs = [x.submit(dostuff, i) for i in xrange(division))
    for i, row in futures.as_completed(fs): 
        final_emit[i] = row

或者，我们可以让函数直接插入行，而不是返回它们。这意味着我们现在正在从多个线程中变异一个共享对象。所以我认为我们需要一个锁，虽然我不是很肯定（numpy的规则有点复杂，我还没有完全阅读你的代码…）。但这可能不会显著影响性能，而且很简单。因此：

import numpy as np
import threading
# etc.
final_emit = np.zeros((division, division, freq_division))
final_emit_lock = threading.Lock()

def dostuff(i):
    fre_boost = np.einsum('ij, k->ijk', boost_factor[i], freq_bins)
    # ...
    to_bin_emit = np.multiply(to_bin_emit, freq_bins, out=to_bin_emit)
    with final_emit_lock:
        final_emit[i] = np.sum(to_bin_emit, axis=1)

with futures.ThreadPoolExecutor(max_workers=8) as x:
    x.map(dostuff, xrange(division))

我所有示例中的

max\u workers=8

应该针对您的机器进行调整。太多的线程是不好的，因为它们开始互相攻击，而不是并行化；线程太少更糟糕，因为您的一些核心处于空闲状态

如果您希望在各种机器上运行此功能，而不是针对每台机器进行调整，那么最好的猜测（对于2.7）通常是：

import multiprocessing
# ...
with futures.ThreadPoolExecutor(max_workers=multiprocessing.cpu_count()) as x:

但是，如果您想从特定的机器中挤出最大性能，您应该测试不同的值。特别是，对于一台典型的四核笔记本电脑，其理想值可以是4到8，具体取决于您正在做的具体工作，而且只尝试所有值比尝试预测更容易。

保持代码简单，而不是优化如果您知道要编写什么算法，请编写一个简单的参考实现。在此基础上，您可以使用Python实现两种方式。您可以尝试将代码矢量化或您可以编译代码以获得良好的性能
即使
np.einsum
或
np.add.at
是在Numba中实现的，任何编译器都很难从您的示例中生成高效的二进制代码
我唯一重写的是一种更有效的标量值数字化方法
编辑
在Numba源代码中，有一个更有效的标量值数字化实现
代码

#From Numba source #Copyright (c) 2012, Anaconda, Inc. #All rights reserved. @nb.njit(fastmath=True) def digitize(x, bins, right=False): # bins are monotonically-increasing n = len(bins) lo = 0 hi = n if right: if np.isnan(x): # Find the first nan (i.e. the last from the end of bins, # since there shouldn't be many of them in practice) for i in range(n, 0, -1): if not np.isnan(bins[i - 1]): return i return 0 while hi > lo: mid = (lo + hi) >> 1 if bins[mid] < x: # mid is too low => narrow to upper bins lo = mid + 1 else: # mid is too high, or is a NaN => narrow to lower bins hi = mid else: if np.isnan(x): # NaNs end up in the last bin return n while hi > lo: mid = (lo + hi) >> 1 if bins[mid] <= x: # mid is too low => narrow to upper bins lo = mid + 1 else: # mid is too high, or is a NaN => narrow to lower bins hi = mid return lo @nb.njit(fastmath=True) def digitize(value, bins): if value<bins[0]: return 0 if value>=bins[bins.shape[0]-1]: return bins.shape[0] for l in range(1,bins.shape[0]): if value>=bins[l-1] and value<bins[l]: return l @nb.njit(fastmath=True,parallel=True) def inner_loop(boost_factor,freq_bins,es): res=np.zeros((boost_factor.shape[0],freq_bins.shape[0]),dtype=np.float64) for i in nb.prange(boost_factor.shape[0]): for j in range(boost_factor.shape[1]): for k in range(freq_bins.shape[0]): ind=nb.int64(digitize(boost_factor[i,j]*freq_bins[k],freq_bins)) res[i,ind]+=boost_factor[i,j]*es[j,k]*freq_bins[ind] return res @nb.njit(fastmath=True) def calc_nb(division,freq_division,cd,boost_factor,freq_bins,es): final_emit = np.empty((division, division, freq_division),np.float64) for i in range(division): final_emit[i,:,:]=inner_loop(boost_factor[i],freq_bins,es) return final_emit

(Quadcore i7) original_code: 118.5s calc_nb: 4.14s #with digitize implementation from Numba source calc_nb: 2.66s

也许解释这段代码的作用会有所帮助…@Julien抱歉，我现在添加了更多的解释。在第21行，您有：to_bin_emit=to_bin_emit.reforme（frequency\u boost1.shape）frequency\u boost1.shape定义在哪里？@TimothyLombard抱歉，这已经被修正了。关于第22行：名称错误：名称“TooBiNixEngress”没有定义。您的示例似乎是独立的，也许您应该考虑在发布之前在笔记本中运行代码剪辑…我喜欢使用新的谷歌工具，谢谢。但是，我认为按照您的建议，速度的提高确实很小，因为
einsum
并没有占用那么多时间。我在看
np.bincount
，但是我认为对于多维数组来说，要正确使用它真的很难，正如您所指出的。哇，这使整个代码的速度提高了大约3倍。你能详细说明你提到的最后两个调整吗。我不太清楚如何在代码中准确地实现这些功能。@matttree好的，我为这两个功能都添加了（未测试的）示例。链接文档中也有一些很好的示例（显然，这些示例不太适合您的代码，但它们经过测试，并且解释得很好）。另外，在最后还有一个注释。@matttree我添加了数字化的Numba源代码实现。这将提供额外的加速。。。
(Quadcore i7) original_code: 118.5s calc_nb: 4.14s #with digitize implementation from Numba source calc_nb: 2.66s