Python 加速用户定义函数_Python_Performance_Cython_Numba

Python 加速用户定义函数

python performance

Python 加速用户定义函数,python,performance,cython,numba,Python,Performance,Cython,Numba,我有一个模拟，最终用户可以提供任意多个函数，然后在最内部的循环中调用这些函数。比如： class Simulation: def __init__(self): self.rates [] self.amount = 1 def add(self, rate): self.rates.append(rate) def run(self, maxtime): for t in range(0, maxti

我有一个模拟，最终用户可以提供任意多个函数，然后在最内部的循环中调用这些函数。比如：

class Simulation:

    def __init__(self):
        self.rates []
        self.amount = 1

    def add(self, rate):
        self.rates.append(rate)

    def run(self, maxtime):
        for t in range(0, maxtime):
            for rate in self.rates:
                self.amount *= rate(t)

def rate(t):
    return t**2

simulation = Simulation()

simulation.add(rate)
simulation.run(100000)

作为一个python循环，这是非常慢的，但是我无法使用我的常规方法来加速循环

因为函数是用户定义的，所以我不能“numpyy”最里面的调用（重写以便最里面的工作由优化的numpy代码完成）

我首先尝试了numba，但是numba不允许将函数传递给其他函数，即使这些函数也是numba编译的。它可以使用闭包，但因为一开始我不知道有多少函数，所以我认为我不能使用它。关闭功能列表失败：

@numba.jit(nopython=True)
def a()
    return 1

@numba.jit(nopython=True)
def b()
    return 2

fs = [a, b]

@numba.jit(nopython=True)
def c()
    total = 0
    for f in fs:
        total += f()
    return total

c()

此操作失败并出现错误：

[...]
  File "/home/syrn/.local/lib/python3.6/site-packages/numba/types/containers.py", line 348, in is_precise
    return self.dtype.is_precise()
numba.errors.InternalError: 'NoneType' object has no attribute 'is_precise' 
[1] During: typing of intrinsic-call at <stdin> (4)

我想也可以做一些字节码黑客：用numba编译函数，将带有函数名的字符串列表传递到

internal

，执行类似

call（func\u name）

的操作，然后重写字节码，使其成为

func\u name（t）

对于cython来说，仅仅编译循环和乘法可能会加快一点，但是如果用户定义的函数仍然是python，那么仅仅调用python函数可能仍然会很慢（尽管我还没有对此进行分析）。我在cython中并没有找到多少关于“动态编译”函数的信息，但我想我需要在用户提供的函数中添加一些类型信息，这似乎。。很难

有没有什么好方法可以使用用户定义的函数来加速循环，而无需解析和生成代码？

我认为您无法加速用户的函数-最终，编写高效的代码是用户的责任。您可以做的是，提供一种与您的程序进行有效交互的可能性，而无需支付开销

您可以使用Cython，如果用户也喜欢使用Cython，那么与纯python解决方案相比，您可以实现大约100的加速

作为基线，我稍微更改了您的示例：函数

rate

做了更多的工作

class Simulation:

    def __init__(self, rates):
        self.rates=list(rates)
        self.amount = 1

    def run(self, maxtime):
        for t in range(0, maxtime):
            for rate in self.rates:
                self.amount += rate(t)

def rate(t):
    return t*t*t+2*t

收益率：

>>> simulation=Simulation([rate])
>>> %timeit simulation.run(10**5)
43.3 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

我们可以使用cython来加快速度，首先是运行

函数：
%%cython
cdef class Simulation:
    cdef int amount
    cdef list rates
    def __init__(self, rates):
        self.rates=list(rates)
        self.amount = 1

    def run(self, int maxtime):
        cdef int t
        for t in range(maxtime):
            for rate in self.rates:
                self.amount *= rate(t)

这给了我们几乎2个因素：
>>> %timeit simulation.run(10**5)
23.2 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

用户还可以使用Cython加快计算速度：
%%cython
def rate(int t):
  return t*t*t+2*t

>>> %timeit simulation.run(10**5)
7.08 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用Cython给我们已经加速6，现在瓶颈是什么？我们仍然在使用python进行多态性/分派，这是非常昂贵的，因为为了使用它，必须创建python对象（即这里的python整数）。我们能用Cython做得更好吗？是的，如果我们为在编译时传递给run
的函数定义了一个接口：
%%cython   
cdef class FunInterface:
   cpdef int calc(self, int t):
      pass

cdef class Simulation:
    cdef int amount
    cdef list rates

    def __init__(self, rates):
        self.rates=list(rates)
        self.amount = 1

    def run(self, int maxtime):
        cdef int t
        cdef FunInterface f
        for t in range(maxtime):
            for f in self.rates:
                self.amount *= f.calc(t)

cdef class  Rate(FunInterface):
    cpdef int calc(self, int t):
        return t*t*t+2*t

这将产生7:
 simulation=Simulation([Rate()])
 >>>%timeit simulation.run(10**5)
 1.03 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

上述代码最重要的部分是第行：
self.amount *= f.calc(t)

不再需要Python来调度，而是使用与C++中的虚拟函数非常类似的机制。这种c++方法的间接/查找开销非常小。这也意味着，函数的结果和参数都不必转换为Python对象。要使其正常工作，Rate
必须是一个cpdef函数，您可以查看更多详细信息，了解继承如何为cpdef函数工作
现在的瓶颈是self.rates中f的行，因为我们仍然需要在每一步中进行大量python交互。下面是一个可能的例子，如果我们能够改进这一点：
%%cython
.....
cdef class Simulation:
    cdef int amount
    cdef FunInterface f  #just one function, no list

    def __init__(self, fun):
        self.f=fun
        self.amount = 1

    def run(self, int maxtime):
        cdef int t
        for t in range(maxtime):
                self.amount *= self.f.calc(t)


另一个因素是2，但是您可以决定是否需要一个更复杂的代码来存储一个没有python交互的FunInterface-对象列表。请给pypypy一个机会。它有一些优化可能会有所帮助。@Scovetta我刚用pypy测量过，不幸的是它需要5倍于普通python的长度。我没有做具体的pypy优化，尽管我已经预热了JIT。你有没有看过这个解决方法？2） 用numpy数组替换列表速率是不可能的？@max9111将函数放入列表并在列表上使用闭包会给我带来一个错误。我想我在一些numba文档中找到了这一点，但我再也找不到了。2） 函数可以是任意复杂的，也可以有任意多个。我不认为有什么（合理的）方法可以将其编码到numpy数组中，是吗？@max9111我添加了一个小示例，使用闭包技术和函数列表以及它引发的错误。完整的回溯比较长，但我不认为它信息量大，而且应该很容易复制。这很有希望。我需要一些时间用我的代码库实现cython并测量增益。除了上一次的重新编译外，我实现了所有功能，加上其他一些功能（boundscheck false，调用numpy替换为gsl），我的加速比达到了20倍。在一小时内运行的内容现在在3分钟内运行。
%%cython
.....
cdef class Simulation:
    cdef int amount
    cdef FunInterface f  #just one function, no list

    def __init__(self, fun):
        self.f=fun
        self.amount = 1

    def run(self, int maxtime):
        cdef int t
        for t in range(maxtime):
                self.amount *= self.f.calc(t)

 >>>  simulation=Simulation(Rate())
 >>> %timeit simulation.run(10**5)
 408 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)