Performance 快速数学并不能提高速度_Performance_Numpy_Math_Cpu_Numba

Performance 快速数学并不能提高速度

performance numpy math

Performance 快速数学并不能提高速度,performance,numpy,math,cpu,numba,Performance,Numpy,Math,Cpu,Numba,我在启用和禁用fastmath选项的情况下运行以下代码 import numpy as np from numba import jit from threading import Thread import time import psutil from tqdm import tqdm @jit(nopython=True, fastmath=True) def compute_angle(vectors): return 180 + np.degrees(np.arctan2(

我在启用和禁用

fastmath

选项的情况下运行以下代码

import numpy as np
from numba import jit
from threading import Thread
import time
import psutil
from tqdm import tqdm


@jit(nopython=True, fastmath=True)
def compute_angle(vectors):
    return 180 + np.degrees(np.arctan2(vectors[:, :, 1], vectors[:, :, 0]))



cpu_usage = list()
times = list()

# Log cpu usage
running = False
def threaded_function():
    while not running:
        time.sleep(0.1)
    print("Start logging CPU")
    while running:
        cpu_usage.append(psutil.cpu_percent())
    print("Stop logging CPU")
thread = Thread(target=threaded_function, args=())
thread.start()


iterations = 1000

# Generate frames
vectors_list = list()
for i in tqdm(range(iterations), total=iterations):
    vectors = np.random.randint(-50, 50, (500, 1000, 2))
    vectors_list.append(vectors)

for i in tqdm(range(iterations), total=iterations):
    s = time.time()
    compute_angle(vectors_list[i])
    e = time.time()
    times.append(e - s)
    # Do not count first iteration
    running = True

running = False

thread.join()

print("Average time per iteration", np.mean(times[1:]))
print("Average CPU usage:", np.mean(cpu_usage))

fastmath=True

的结果如下：

Average time per iteration 0.02076407738992044
Average CPU usage: 6.738916256157635`

Average time per iteration 0.020854528721149738
Average CPU usage: 6.676455696202531

fastmath=False

的结果如下：

Average time per iteration 0.02076407738992044
Average CPU usage: 6.738916256157635`

Average time per iteration 0.020854528721149738
Average CPU usage: 6.676455696202531

因为我在使用数学运算，我应该期望一些收益吗？我还试图安装

icc rt

，但我不确定如何检查它是否已启用。

谢谢大家!

要使SIMD矢量化工作起来，还缺少一些东西。为了获得最佳性能，还必须避免昂贵的临时阵列，如果使用部分矢量化函数，这些临时阵列可能无法优化

函数调用必须是内联的
必须在编译时知道内存访问模式。在下面的示例中，这是通过断言向量来完成的。shape[2]==2。通常，最后一个阵列的形状也可能大于两个，这对于SIMD矢量化来说要复杂得多
除零检查也可以避免SIMD矢量化，如果不进行优化，速度会很慢。我通过手动计算
```
div_pi=1/np.pi
```
一次，而不是循环中的简单乘法来实现这一点。如果无法避免重复除法，您可以使用
```
error\u model=“numpy”
```
避免通过零检查进行除法

示例

import numpy as np
import numba as nb

@nb.njit(fastmath=True)
def your_function(vectors):
    return 180 + np.degrees(np.arctan2(vectors[:, :, 1], vectors[:, :, 0]))

@nb.njit(fastmath=True)#False
def optimized_function(vectors):
    assert vectors.shape[2]==2

    res=np.empty((vectors.shape[0],vectors.shape[1]),dtype=vectors.dtype)
    div_pi=180/np.pi
    for i in range(vectors.shape[0]):
        for j in range(vectors.shape[1]):
            res[i,j]=np.arctan2(vectors[i,j,1],vectors[i,j,0])*div_pi+180
    return res

计时

vectors=np.random.rand(1000,1000,2)

%timeit your_function(vectors)
#no difference between fastmath=True or False, no SIMD-vectorization at all
#23.3 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit optimized_function(vectors)
#with fastmath=False #SIMD-vectorized, but with the slower (more accurate) SVML algorithm
#9.03 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#with fastmath=True  #SIMD-vectorized, but with the faster(less accurate) SVML algorithm
#4.45 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

要使SIMD矢量化正常工作，还缺少一些东西。为了获得最佳性能，还必须避免昂贵的临时阵列，如果使用部分矢量化函数，这些临时阵列可能无法优化

函数调用必须是内联的
必须在编译时知道内存访问模式。在下面的示例中，这是通过断言向量来完成的。shape[2]==2。通常，最后一个阵列的形状也可能大于两个，这对于SIMD矢量化来说要复杂得多
除零检查也可以避免SIMD矢量化，如果不进行优化，速度会很慢。我通过手动计算
```
div_pi=1/np.pi
```
一次，而不是循环中的简单乘法来实现这一点。如果无法避免重复除法，您可以使用
```
error\u model=“numpy”
```
避免通过零检查进行除法

示例

import numpy as np
import numba as nb

@nb.njit(fastmath=True)
def your_function(vectors):
    return 180 + np.degrees(np.arctan2(vectors[:, :, 1], vectors[:, :, 0]))

@nb.njit(fastmath=True)#False
def optimized_function(vectors):
    assert vectors.shape[2]==2

    res=np.empty((vectors.shape[0],vectors.shape[1]),dtype=vectors.dtype)
    div_pi=180/np.pi
    for i in range(vectors.shape[0]):
        for j in range(vectors.shape[1]):
            res[i,j]=np.arctan2(vectors[i,j,1],vectors[i,j,0])*div_pi+180
    return res

计时

vectors=np.random.rand(1000,1000,2)

%timeit your_function(vectors)
#no difference between fastmath=True or False, no SIMD-vectorization at all
#23.3 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit optimized_function(vectors)
#with fastmath=False #SIMD-vectorized, but with the slower (more accurate) SVML algorithm
#9.03 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#with fastmath=True  #SIMD-vectorized, but with the faster(less accurate) SVML algorithm
#4.45 ms ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

您的代码只是对

numpy

的几个简单调用，这些调用已经进行了大量优化。您在20毫秒内接到50万个arctan电话，即每秒250万个。你需要现实地对待你的期望。使用

np.degrees

是没有效率的。自己做会更快（因为FMA和fastmath）

np.arctan2

在这里不会很快，因为跨步内存访问可能会阻止SIMD指令的使用。理论上，在CPU上使用浮点16而不是32或64会有什么不同吗？它们至少会节省内存吗？它们仍然存储为32位还是64位？@JérômeRichard至少如果您设置

断言向量。shape[2]==2

您可以避免未知内存布局的问题（跨步不是真正的问题）。您的代码只是对

numpy

的几个简单调用，这些调用已经进行了大量优化。您在20毫秒内接到50万个arctan电话，即每秒250万个。你需要现实地对待你的期望。使用

np.degrees

是没有效率的。自己做会更快（因为FMA和fastmath）

np.arctan2

在这里不会很快，因为跨步内存访问可能会阻止SIMD指令的使用。理论上，在CPU上使用浮点16而不是32或64会有什么不同吗？它们至少会节省内存吗？它们仍然存储为32位还是64位？@JérômeRichard至少如果你设置了断言向量。shape[2]==2你可以避免未知内存布局的问题（跨步不是真正的问题）。为什么不将

divêpi

预先乘以180？除此之外，请注意，在表达式开头移动

180*div_pi

应该快一点，而不是快一点（由于浮点非关联性而没有快速数学）。最后，是否有文件证明Numba从断言中获益？好了，Clang开发人员正在做这件事，但上次我查看时，它还没有被主控或完全使用。@JérômeRichard我从文档中不知道这个细节。这个问题在github的某个问题中讨论过（大约2-3年前）。它支持相当长的时间，也有利于自动展开小的内部循环。关于叮当声：我对C语言不是很有经验。但是像这样的东西不是吗？

…双向量[][vec\u shape\u 1][2]…

的工作方式不太一样吗？（手动检查？）非常感谢。我还有两个问题要问你。你从哪里学到这些东西的？你说函数调用必须是内联的是什么意思？同样@max9111，假设

向量。shape[0]

和

向量。shape[1]

表示视频的高度和宽度：它们的值在执行期间是恒定的，但在执行之间可能不同（取决于输入视频）。是否有一种方法可以像对待断言向量那样将SIMD向量化。shape[2]==2？@user1315621最后一个维度通常是重要的维度。为什么不将

div_pi

预乘180？除此之外，请注意，在表达式开头移动

180*div_pi

应该快一点，而不是快一点（由于浮点非关联性而没有快速数学）。最后，是否有文件证明Numba从断言中获益？好了，Clang开发人员正在做这件事，但上次我查看时，它还没有被主控或完全使用。@JérômeRichard我从文档中不知道这个细节。国际空间站