Python numpy ufuncs速度与环路速度之比
我读过很多“使用numpy避免循环”。所以,我试过了。我正在使用这段代码(简化版)。一些辅助数据:Python numpy ufuncs速度与环路速度之比,python,performance,numpy,for-loop,numpy-ufunc,c,cython,numba,numexpr,scipy,Python,Performance,Numpy,For Loop,Numpy Ufunc,C,Cython,Numba,Numexpr,Scipy,我读过很多“使用numpy避免循环”。所以,我试过了。我正在使用这段代码(简化版)。一些辅助数据: In[1]: import numpy as np resolution = 1000 # this parameter varies tim = np.linspace(-np.pi, np.pi, resolution) prec = np.arange(1, resolution +
In[1]: import numpy as np
resolution = 1000 # this parameter varies
tim = np.linspace(-np.pi, np.pi, resolution)
prec = np.arange(1, resolution + 1)
prec = 2 * prec - 1
values = np.zeros_like(tim)
我的第一个实现是使用for
循环:
In[2]: for i, ti in enumerate(tim):
values[i] = np.sum(np.sin(prec * ti))
然后,我去掉了显式的for循环,实现了这一点:
In[3]: values = np.sum(np.sin(tim[:, np.newaxis] * prec), axis=1)
这种解决方案对于小型阵列来说速度更快,但当我放大时,我得到了这样的时间依赖性:
我错过了什么或者这是正常的行为?如果不是,在哪里挖掘
编辑:根据评论,这里是一些附加信息。使用IPython的%timeit
和%timeit
测量时间,每次运行都在新内核上执行。我的笔记本电脑是acer aspire v7-482pg(i7,8GB)。我正在使用:
- python 3.5.2
- numpy 1.11.2+mkl
- 视窗10
- 这是正常和预期的行为。它太简单了,无法在任何地方应用“使用numpy避免for循环”语句。如果你在处理内部循环,它(几乎)总是正确的。但在外部循环的情况下(就像您的情况一样),例外情况要多得多。特别是如果另一种选择是使用广播,因为这会通过使用更多的内存来加快操作速度
只需为“使用numpy避免for循环”语句添加一点背景信息即可:
NumPy数组存储为类型为的连续数组。Python
int
与Cint
不同!因此,每当迭代数组中的每个项时,都需要从数组中插入该项,将其转换为Pythonint
,然后对其执行任何操作,最后可能需要再次将其转换为c整数(称为装箱和取消装箱值)。例如,您希望使用Python对数组中的项进行求和:
import numpy as np
arr = np.arange(1000)
%%timeit
acc = 0
for item in arr:
acc += item
# 1000 loops, best of 3: 478 µs per loop
你最好使用numpy:
%timeit np.sum(arr)
# 10000 loops, best of 3: 24.2 µs per loop
即使将循环推进到Python C代码中,也离numpy性能相差甚远:
%timeit sum(arr)
# 1000 loops, best of 3: 387 µs per loop
这条规则可能会有例外,但这些例外非常稀少。至少只要有一些等价的numpy功能。因此,如果要迭代单个元素,那么应该使用numpy
有时一个简单的python循环就足够了。它并没有被广泛宣传,但与Python函数相比,numpy函数有着巨大的开销。例如,考虑一个3元数组:
arr = np.arange(3)
%timeit np.sum(arr)
%timeit sum(arr)
哪一个更快
解决方案:Python函数的性能优于numpy解决方案:
# 10000 loops, best of 3: 21.9 µs per loop <- numpy
# 100000 loops, best of 3: 6.27 µs per loop <- python
def fun_func(tim, prec, values):
x = tim[:, np.newaxis]
x = x * prec
x = np.sin(x)
x = np.sum(x, axis=1)
return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def broadcast_solution(tim, prec, values):
2 1 37 37.0 0.0 x = tim[:, np.newaxis]
3 1 1783345 1783345.0 13.9 x = x * prec
4 1 9879333 9879333.0 77.1 x = np.sin(x)
5 1 1153789 1153789.0 9.0 x = np.sum(x, axis=1)
6 1 11 11.0 0.0 return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 def loop_solution(tim, prec, values):
9 10001 62502 6.2 0.5 for i, ti in enumerate(tim):
10 10000 1287698 128.8 10.5 x = prec * ti
11 10000 9758633 975.9 79.7 x = np.sin(x)
12 10000 1058995 105.9 8.6 x = np.sum(x)
13 10000 75760 7.6 0.6 values[i] = x
95%用于循环内部,我甚至将循环体拆分为几个部分来验证这一点:
def fun_func(tim, prec, values):
for i, ti in enumerate(tim):
x = prec * ti
x = np.sin(x)
x = np.sum(x)
values[i] = x
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 101 609 6.0 3.5 for i, ti in enumerate(tim):
3 100 4521 45.2 26.3 x = prec * ti
4 100 4646 46.5 27.0 x = np.sin(x)
5 100 6731 67.3 39.1 x = np.sum(x)
6 100 714 7.1 4.1 values[i] = x
这里的时间消费者是np.multiply
,np.sin
,np.sum
,您可以通过比较每次呼叫的时间和开销来轻松检查:
arr = np.ones(1, float)
%timeit np.sum(arr)
# 10000 loops, best of 3: 22.6 µs per loop
因此,只要与计算运行时相比,计算函数调用开销较小,您就会有类似的运行时。即使有100个项目,您也非常接近开销时间。诀窍在于知道他们在哪一点上收支平衡。对于1000个项目,呼叫开销仍然很大:
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1001 5864 5.9 2.4 for i, ti in enumerate(tim):
3 1000 42817 42.8 17.2 x = prec * ti
4 1000 119327 119.3 48.0 x = np.sin(x)
5 1000 73313 73.3 29.5 x = np.sum(x)
6 1000 7287 7.3 2.9 values[i] = x
但是使用分辨率=5000
时,与运行时相比,开销非常低:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 5001 29412 5.9 0.9 for i, ti in enumerate(tim):
3 5000 388827 77.8 11.6 x = prec * ti
4 5000 2442460 488.5 73.2 x = np.sin(x)
5 5000 441337 88.3 13.2 x = np.sum(x)
6 5000 36187 7.2 1.1 values[i] = x
当您在每个np.sin
通话中花费500美元时,您不再关心20美元的开销
需要注意的是:line\u profiler
可能包括每条线路的额外开销,也可能包括每个函数调用的额外开销,因此函数调用开销变得可忽略的点可能更低
您的广播解决方案
我从分析第一个解决方案开始,让我们对第二个解决方案也这样做:
# 10000 loops, best of 3: 21.9 µs per loop <- numpy
# 100000 loops, best of 3: 6.27 µs per loop <- python
def fun_func(tim, prec, values):
x = tim[:, np.newaxis]
x = x * prec
x = np.sin(x)
x = np.sum(x, axis=1)
return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def broadcast_solution(tim, prec, values):
2 1 37 37.0 0.0 x = tim[:, np.newaxis]
3 1 1783345 1783345.0 13.9 x = x * prec
4 1 9879333 9879333.0 77.1 x = np.sin(x)
5 1 1153789 1153789.0 9.0 x = np.sum(x, axis=1)
6 1 11 11.0 0.0 return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 def loop_solution(tim, prec, values):
9 10001 62502 6.2 0.5 for i, ti in enumerate(tim):
10 10000 1287698 128.8 10.5 x = prec * ti
11 10000 9758633 975.9 79.7 x = np.sin(x)
12 10000 1058995 105.9 8.6 x = np.sum(x)
13 10000 75760 7.6 0.6 values[i] = x
再次使用分辨率为100的测线仪:
def fun_func(tim, prec, values):
for i, ti in enumerate(tim):
values[i] = np.sum(np.sin(prec * ti))
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 101 752 7.4 5.7 for i, ti in enumerate(tim):
3 100 12449 124.5 94.3 values[i] = np.sum(np.sin(prec * ti))
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 27 27.0 0.5 x = tim[:, np.newaxis]
3 1 638 638.0 12.9 x = x * prec
4 1 3963 3963.0 79.9 x = np.sin(x)
5 1 326 326.0 6.6 x = np.sum(x, axis=1)
6 1 4 4.0 0.1 return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 28 28.0 0.0 x = tim[:, np.newaxis]
3 1 17716 17716.0 14.6 x = x * prec
4 1 91174 91174.0 75.3 x = np.sin(x)
5 1 12140 12140.0 10.0 x = np.sum(x, axis=1)
6 1 10 10.0 0.0 return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 34 34.0 0.0 x = tim[:, np.newaxis]
3 1 333685 333685.0 11.1 x = x * prec
4 1 2391812 2391812.0 79.6 x = np.sin(x)
5 1 280832 280832.0 9.3 x = np.sum(x, axis=1)
6 1 14 14.0 0.0 return x
这已经大大超过了开销时间,因此我们比循环快了10倍
我还对分辨率=1000
进行了分析:
def fun_func(tim, prec, values):
for i, ti in enumerate(tim):
values[i] = np.sum(np.sin(prec * ti))
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 101 752 7.4 5.7 for i, ti in enumerate(tim):
3 100 12449 124.5 94.3 values[i] = np.sum(np.sin(prec * ti))
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 27 27.0 0.5 x = tim[:, np.newaxis]
3 1 638 638.0 12.9 x = x * prec
4 1 3963 3963.0 79.9 x = np.sin(x)
5 1 326 326.0 6.6 x = np.sum(x, axis=1)
6 1 4 4.0 0.1 return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 28 28.0 0.0 x = tim[:, np.newaxis]
3 1 17716 17716.0 14.6 x = x * prec
4 1 91174 91174.0 75.3 x = np.sin(x)
5 1 12140 12140.0 10.0 x = np.sum(x, axis=1)
6 1 10 10.0 0.0 return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 34 34.0 0.0 x = tim[:, np.newaxis]
3 1 333685 333685.0 11.1 x = x * prec
4 1 2391812 2391812.0 79.6 x = np.sin(x)
5 1 280832 280832.0 9.3 x = np.sum(x, axis=1)
6 1 14 14.0 0.0 return x
并且精度=5000
:
def fun_func(tim, prec, values):
for i, ti in enumerate(tim):
values[i] = np.sum(np.sin(prec * ti))
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 101 752 7.4 5.7 for i, ti in enumerate(tim):
3 100 12449 124.5 94.3 values[i] = np.sum(np.sin(prec * ti))
%lprun -f fun_func fun_func(tim, prec, values)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 27 27.0 0.5 x = tim[:, np.newaxis]
3 1 638 638.0 12.9 x = x * prec
4 1 3963 3963.0 79.9 x = np.sin(x)
5 1 326 326.0 6.6 x = np.sum(x, axis=1)
6 1 4 4.0 0.1 return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 28 28.0 0.0 x = tim[:, np.newaxis]
3 1 17716 17716.0 14.6 x = x * prec
4 1 91174 91174.0 75.3 x = np.sin(x)
5 1 12140 12140.0 10.0 x = np.sum(x, axis=1)
6 1 10 10.0 0.0 return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def fun_func(tim, prec, values):
2 1 34 34.0 0.0 x = tim[:, np.newaxis]
3 1 333685 333685.0 11.1 x = x * prec
4 1 2391812 2391812.0 79.6 x = np.sin(x)
5 1 280832 280832.0 9.3 x = np.sum(x, axis=1)
6 1 14 14.0 0.0 return x
1000大小更快,但正如我们在那里看到的,在循环解决方案中,调用开销仍然是不可忽略的。但是对于resolution=5000
来说,每个步骤花费的时间几乎相同(有些慢一些,有些快一些,但总体上非常相似)
另一个影响是,当您进行乘法运算时,实际的广播。即使使用非常智能的numpy解决方案,这仍然包括一些额外的计算。对于resolution=10000
您可以看到广播乘法相对于循环解决方案开始占用更多的“%time”:
# 10000 loops, best of 3: 21.9 µs per loop <- numpy
# 100000 loops, best of 3: 6.27 µs per loop <- python
def fun_func(tim, prec, values):
x = tim[:, np.newaxis]
x = x * prec
x = np.sin(x)
x = np.sum(x, axis=1)
return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def broadcast_solution(tim, prec, values):
2 1 37 37.0 0.0 x = tim[:, np.newaxis]
3 1 1783345 1783345.0 13.9 x = x * prec
4 1 9879333 9879333.0 77.1 x = np.sin(x)
5 1 1153789 1153789.0 9.0 x = np.sum(x, axis=1)
6 1 11 11.0 0.0 return x
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 def loop_solution(tim, prec, values):
9 10001 62502 6.2 0.5 for i, ti in enumerate(tim):
10 10000 1287698 128.8 10.5 x = prec * ti
11 10000 9758633 975.9 79.7 x = np.sin(x)
12 10000 1058995 105.9 8.6 x = np.sum(x)
13 10000 75760 7.6 0.6 values[i] = x
但除了实际花费的时间外,还有另一件事:内存消耗。循环解决方案需要
O(n)
内存,因为您总是处理n
元素。然而,广播解决方案需要O(n*n)
内存。如果在循环中使用resolution=20000
,您可能需要等待一段时间,但它仍然只需要8bytes/element*20000 element~=160kB
,但在广播中您需要~3GB
。这忽略了常数因子(比如临时数组或中间数组)!如果你再往前走,你的内存会很快用完
是时候再次总结要点了:
- 如果对numpy数组中的单个项执行python循环,那么就错了
- 如果在numpy数组的子数组上循环,请确保每个循环中的函数调用开销与在函数中花费的时间相比是可忽略的
- 如果广播numpy数组,请确保内存没有用完
- 只有当代码太慢时才优化它!如果速度太慢,则仅在分析代码之后进行优化
- 不要盲目相信简化的语句,也不要在没有分析的情况下进行优化
最后一个想法: