Python PyOpenCL程序/FFT的优化_Python_Optimization_Opencl_Fft_Pyopencl

Python PyOpenCL程序/FFT的优化

python optimization opencl

Python PyOpenCL程序/FFT的优化,python,optimization,opencl,fft,pyopencl,Python,Optimization,Opencl,Fft,Pyopencl,程序概述：这里的大部分代码创建FrameProcessor对象。使用一些数据形状（通常为2048xN）初始化该对象，然后可以调用该对象以使用一系列内核proc_帧处理数据。对于长度为2048的每个向量，程序将：应用汉宁窗口元素相乘2048*2048 做一个线性插值，重新映射数值，从非线性分光计仓映射到波数空间中的线性，该信号来源于一个不太重要的细节，但我认为最好包括在内，以防不清楚应用FFT 问题：我想跑得更快！下面的代码执行得并不差，但对于这个项目，我需要它尽可能快。但是，我不确定如何进一

程序概述：这里的大部分代码创建FrameProcessor对象。使用一些数据形状（通常为2048xN）初始化该对象，然后可以调用该对象以使用一系列内核proc_帧处理数据。对于长度为2048的每个向量，程序将：

应用汉宁窗口元素相乘2048*2048 做一个线性插值，重新映射数值，从非线性分光计仓映射到波数空间中的线性，该信号来源于一个不太重要的细节，但我认为最好包括在内，以防不清楚应用FFT 问题：我想跑得更快！下面的代码执行得并不差，但对于这个项目，我需要它尽可能快。但是，我不确定如何进一步改进此代码。所以，我正在寻找相关阅读、我应该使用的替代库、代码结构的更改等方面的建议

当前性能：在使用GeForce RTX 2080的钻机上，我使用n=60获得的基准测试似乎提供了最好的性能：

With n = 60 
Average framerate over 1000 frames: 740Hz
Effective A-line rate over 1000 frames: 44399Hz

在这里，FFT似乎是一个很大的瓶颈。当我运行示例而不运行FFT时，我得到以下结果：

With n = 60 
Average framerate over 1000 frames: 2494Hz
Effective A-line rate over 1000 frames: 149652Hz

然而，我不知道如何改进我正在使用的计划的性能！这些文档似乎没有提到任何优化步骤，我在github repo的测试gpyfft分支中使用的代码的性能甚至更差

评测：在proc_frame函数上使用cProfile的结果如下所示：

Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(reshape)
        4    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
        1    0.000    0.000    0.000    0.000 <generated code>:101(set_args)
        2    0.000    0.000    0.000    0.000 <generated code>:4(enqueue_knl_kernel_fft)
        1    0.000    0.000    0.000    0.000 <generated code>:71(set_args)
        1    0.000    0.000    0.002    0.002 <string>:1(<module>)
        3    0.000    0.000    0.000    0.000 __init__.py:1288(result)
        1    0.000    0.000    0.000    0.000 __init__.py:1294(result)
        3    0.000    0.000    0.001    0.000 __init__.py:1522(enqueue_copy)
        1    0.000    0.000    0.000    0.000 __init__.py:222(wrap_in_tuple)
       10    0.000    0.000    0.000    0.000 __init__.py:277(name)
        3    0.000    0.000    0.000    0.000 __init__.py:281(default)
        2    0.000    0.000    0.000    0.000 __init__.py:285(annotation)
        9    0.000    0.000    0.000    0.000 __init__.py:289(kind)
        1    0.000    0.000    0.000    0.000 __init__.py:375(__init__)
        2    0.000    0.000    0.000    0.000 __init__.py:575(wrapper)
        2    0.000    0.000    0.000    0.000 __init__.py:596(parameters)
        1    0.000    0.000    0.000    0.000 __init__.py:659(_bind)
        1    0.000    0.000    0.000    0.000 __init__.py:787(bind)
        2    0.000    0.000    0.000    0.000 __init__.py:833(kernel_set_args)
        2    0.000    0.000    0.000    0.000 __init__.py:837(kernel_call)
        1    0.000    0.000    0.000    0.000 _asarray.py:16(asarray)
        1    0.000    0.000    0.000    0.000 _internal.py:830(npy_ctypes_check)
        1    0.000    0.000    0.000    0.000 abc.py:137(__instancecheck__)
        1    0.000    0.000    0.000    0.000 api.py:376(empty_like)
        1    0.000    0.000    0.000    0.000 api.py:405(to_device)
        3    0.000    0.000    0.000    0.000 api.py:466(_synchronize)
        2    0.000    0.000    0.000    0.000 api.py:678(prepared_call)
        2    0.000    0.000    0.000    0.000 api.py:688(__call__)
        2    0.000    0.000    0.000    0.000 api.py:779(__call__)
        2    0.000    0.000    0.000    0.000 array.py:1474(add_event)
        1    0.000    0.000    0.000    0.000 array.py:28(f_contiguous_strides)
        1    0.000    0.000    0.000    0.000 array.py:38(c_contiguous_strides)
        1    0.000    0.000    0.000    0.000 array.py:393(__init__)
        3    0.000    0.000    0.000    0.000 array.py:48(equal_strides)
        1    0.000    0.000    0.000    0.000 array.py:520(flags)
        1    0.000    0.000    0.000    0.000 array.py:580(set)
        1    0.000    0.000    0.000    0.000 array.py:59(is_f_contiguous_strides)
        1    0.000    0.000    0.000    0.000 array.py:61(_dtype_is_object)
        1    0.000    0.000    0.000    0.000 array.py:63(is_c_contiguous_strides)
        1    0.000    0.000    0.001    0.001 array.py:635(_get)
        1    0.000    0.000    0.000    0.000 array.py:68(__init__)
        1    0.000    0.000    0.001    0.001 array.py:689(get)
        1    0.000    0.000    0.000    0.000 computation.py:620(__call__)
        2    0.000    0.000    0.000    0.000 computation.py:641(__call__)
        1    0.000    0.000    0.000    0.000 dtypes.py:75(normalize_type)
        1    0.000    0.000    0.001    0.001 frameprocessor.py:130(FFT)
        1    0.000    0.000    0.001    0.001 frameprocessor.py:137(interp_hann)
        1    0.000    0.000    0.002    0.002 frameprocessor.py:146(proc_frame)
        1    0.000    0.000    0.000    0.000 frameprocessor.py:20(npcast)
        1    0.000    0.000    0.000    0.000 frameprocessor.py:23(rshp)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:197(_reshape_dispatcher)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:202(reshape)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:55(_wrapfunc)
        1    0.000    0.000    0.000    0.000 ocl.py:109(allocate)
        1    0.000    0.000    0.000    0.000 ocl.py:112(_copy_array)
        2    0.000    0.000    0.000    0.000 ocl.py:223(_prepared_call)
        2    0.000    0.000    0.000    0.000 ocl.py:225(<listcomp>)
        1    0.000    0.000    0.000    0.000 ocl.py:28(__init__)
        1    0.000    0.000    0.001    0.001 ocl.py:63(get)
        1    0.000    0.000    0.000    0.000 ocl.py:88(array)
        1    0.000    0.000    0.000    0.000 signature.py:308(bind_with_defaults)
        1    0.000    0.000    0.000    0.000 {built-in method _abc._abc_instancecheck}
        1    0.000    0.000    0.002    0.002 {built-in method builtins.exec}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
       12    0.000    0.000    0.000    0.000 {built-in method builtins.hasattr}
       37    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
       14    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.next}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.setattr}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.array}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.empty}
        2    0.001    0.001    0.001    0.001 {built-in method pyopencl._cl._enqueue_read_buffer}
        1    0.000    0.000    0.000    0.000 {built-in method pyopencl._cl._enqueue_write_buffer}
        4    0.000    0.000    0.000    0.000 {built-in method pyopencl._cl.enqueue_nd_range_kernel}
        2    0.000    0.000    0.000    0.000 {built-in method pyopencl._cl.get_cl_header_version}
        6    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'astype' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        3    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'reshape' of 'numpy.ndarray' objects}
        2    0.000    0.000    0.000    0.000 {method 'values' of 'mappingproxy' objects}

编辑：更新代码和基准编辑2：添加了cProfile结果

在中复制我的答复以供参考

从您希望它使用的任何pyopencl队列（可能是与要传递给FFT的数组相关联的队列）创建reikna线程对象基于此线程创建FFT计算将pyopencl数组传递给它，而不进行任何转换。您可以根据pyopencl数组的缓冲区创建一个reikna数组，方法是将其作为base_data关键字传递，但如果您只需要使用FFT，则不需要这样做。 Reikna线程是pyopencl上下文+队列之上的包装器，Reikna数组是pyopencl数组的子类，因此互操作应该非常简单

以一种快速而肮脏的方式应用它，请随意改进，我得到：。基本上，这些变化是：

从现有队列self.thr=self.api.Threadself.queue创建线程在FFT中使用PyOpenCL缓冲区，而不将其复制到CPU。我得到的结果是：

$ python frameprocessor.py # original version
With n = 60 
Average framerate over 1000 frames: 434Hz
Effective A-line rate over 1000 frames: 26012Hz
$ python frameprocessor2.py # modified version
With n = 60 
Average framerate over 1000 frames: 2191Hz
Effective A-line rate over 1000 frames: 131478Hz

你做过任何分析吗？我没有，但我只是尝试了一下cProfile并添加了结果。不过我不太确定该怎么做。哇，谢谢你做了改变！我只是想做同样的事情，显然在这个过程中，我把事情复杂化了。非常感谢您的快速回复！