python numba.guvectorize失败:“0”;LV:Can';t由于内存冲突而矢量化”;
我试图找出如何使用python numba.guvectorize失败:“0”;LV:Can';t由于内存冲突而矢量化”;,python,numpy,numba,Python,Numpy,Numba,我试图找出如何使用numba为矢量化数组操作生成numpy样式的ufunc。我注意到我的性能非常慢,因此我尝试通过在代码中调用以下代码进行调试,如下所示: 显然,由于内存冲突,我的循环没有被矢量化 再次从FAQ页面上,我看到“当内存访问模式不平凡时”会发生这种情况。我不清楚这意味着什么,但我尝试矢量化的代码对我来说似乎很平凡: @guvectorize(['void(f4[:,:], b1[:,:], f8, f4, f4[:,:])'], '(n,m), (n,m),
numba
为矢量化数组操作生成numpy
样式的ufunc。我注意到我的性能非常慢,因此我尝试通过在代码中调用以下代码进行调试,如下所示:
显然,由于内存冲突,我的循环没有被矢量化
再次从FAQ页面上,我看到“当内存访问模式不平凡时”会发生这种情况。我不清楚这意味着什么,但我尝试矢量化的代码对我来说似乎很平凡:
@guvectorize(['void(f4[:,:], b1[:,:], f8, f4, f4[:,:])'],
'(n,m), (n,m), (), () -> (n,m)', cache=True)
def enforce_cutoff(img, mask, max, nodata, out):
for i in range(img.shape[0]):
for j in range(img.shape[1]):
if mask[i,j]:
out[i,j] = nodata
else:
if img[i,j]<max:
out[i,j] = img[i,j]
else:
out[i,j] = max-0.1
@guvectorize(['void(f4[:,:],b1[:,:],f8,f4,f4[:,:]),
“(n,m),(n,m),(),()->(n,m)”,cache=True)
def强制_切断(img、mask、max、nodata、out):
对于范围内的i(img.shape[0]):
对于范围内的j(img.形状[1]):
如果掩码[i,j]:
out[i,j]=nodata
其他:
如果img[i,j]一个可能的解决方法是确保数组是C-连续的。如果它们不是c-contigous,则会被复制
示例
import numba as nb
import numpy as np
@nb.njit(cache=True,parallel=True)
def enforce_cutoff_2(img, mask, max, nodata, out):
#create a contigous copy if array isn't c-contiguous
img=np.ascontiguousarray(img)
mask=np.ascontiguousarray(mask)
for i in nb.prange(img.shape[0]):
for j in range(img.shape[1]):
if mask[i,j]:
out[i,j] = nodata
else:
if img[i,j]<max:
out[i,j] = img[i,j]
else:
out[i,j] = max-0.1
您的访问模式相当复杂。如果显式声明C-contigous数组,则SIMD矢量化是可能的<代码>@nb.guvectorize(['void(f4[:,::1],b1[:,::1],f8,f4,f4[:,:::1]),','(n,m),(),(),()->(n,m),cache=True)
感谢@max9111允许此代码的矢量化。我感谢你的建议。我想知道您是否可以详细说明是什么让内存访问变得如此复杂-我想这只是因为同时访问了几个数组?我也在尝试矢量化一些其他函数,其中至少有一个比这个函数要复杂得多。像这样协调多个数组访问的函数的“矢量化”有什么限制?你能推荐一些好的资源来帮助我更好地理解这一点吗?谢谢如果您真的想在非连续数组上对SIMD矢量化代码,通常是时候使用内部函数编写显式SIMD矢量化代码了。(在C或Fortran中完全相同,甚至在比这简单得多的代码中也是如此)例如:如果不能保证输入数组不是连续的,但通常是连续的,最简单的事情通常是使它们连续(通过np.ascontiguousarray,如果输入不是C连续的,它会创建一个连续副本)。
LV: Checking a loop in "_ZN7AtmCorr18enforce_cutoff$241E5ArrayIfLi2E1A7mutable7alignedE5ArrayIbLi2E1A7mutable7alignedEdf5ArrayIfLi2E1A7mutable7alignedE" from enforce_cutoff
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: B40.us
LV: Found an induction variable.
LV: Found an induction variable.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing: Cannot prove legality.
LV: Checking a loop in "__gufunc__._ZN7AtmCorr18enforce_cutoff$241E5ArrayIfLi2E1A7mutable7alignedE5ArrayIbLi2E1A7mutable7alignedEdf5ArrayIfLi2E1A7mutable7alignedE" from <numba.npyufunc.wrappers._GufuncWrapper object at 0x0000020A848A6438>
LV: Loop hints: force=? width=0 unroll=0
LV: Not vectorizing: Cannot prove legality.
LV: Checking a loop in "_ZN7AtmCorr18enforce_cutoff$241E5ArrayIfLi2E1A7mutable7alignedE5ArrayIbLi2E1A7mutable7alignedEdf5ArrayIfLi2E1A7mutable7alignedE" from <numba.npyufunc.wrappers._GufuncWrapper object at 0x0000020A848A6438>
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: B40.us
LV: Found an induction variable.
LV: Found an induction variable.
LV: Found an induction variable.
LV: Found an induction variable.
LV: Did not find one integer induction var.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing: Cannot prove legality.
LV: Checking a loop in "_ZN7AtmCorr18enforce_cutoff$241E5ArrayIfLi2E1A7mutable7alignedE5ArrayIbLi2E1A7mutable7alignedEdf5ArrayIfLi2E1A7mutable7alignedE" from <numba.npyufunc.wrappers._GufuncWrapper object at 0x0000020A848A6438>
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: B20.us.us
LV: Found an induction variable.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing: Cannot prove legality.
import numba as nb
import numpy as np
@nb.njit(cache=True,parallel=True)
def enforce_cutoff_2(img, mask, max, nodata, out):
#create a contigous copy if array isn't c-contiguous
img=np.ascontiguousarray(img)
mask=np.ascontiguousarray(mask)
for i in nb.prange(img.shape[0]):
for j in range(img.shape[1]):
if mask[i,j]:
out[i,j] = nodata
else:
if img[i,j]<max:
out[i,j] = img[i,j]
else:
out[i,j] = max-0.1
#contiguous arrays
img=np.random.rand(1000,1000).astype(np.float32)
mask=np.random.rand(1000,1000)>0.5
max=0.5
nodata=1.
out=np.empty((img.shape[0],img.shape[1]),dtype=np.float32)
%timeit enforce_cutoff_2(img, mask, max, nodata, out)
#single-thread
#678 µs ± 3.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#parallel
#143 µs ± 1.87 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#non contigous arrays
img=np.random.rand(2000,2000).astype(np.float32)
mask=np.random.rand(2000,2000)>0.5
img=img[0:-1:2,0:-1:2]
mask=mask[0:-1:2,0:-1:2]
max=0.5
nodata=1.
out=np.empty((img.shape[0],img.shape[1]),dtype=np.float32)
%timeit enforce_cutoff_2(img, mask, max, nodata, out)
#single threaded
#with contiguous copy
#1.78 ms ± 9.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#without contiguous copy
#5.76 ms ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#parallel
#with contiguous copy
#1.42 ms ± 7.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
##without contiguous copy
#1.08 ms ± 75.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)