Python Cython中优化代码的提示_Python_Optimization_Cython

Python Cython中优化代码的提示

python optimization

Python Cython中优化代码的提示,python,optimization,cython,Python,Optimization,Cython,我有一个相对简单的问题（我想）。我正在编写一段Cython代码，当给定应变和特定方向（即，对于一定量的应变，平行于给定方向的半径）时，计算应变椭圆的半径。该函数在每个程序运行期间被调用了数百万次，分析表明，从性能角度讲，该函数是限制因素。代码如下： # importing math functions from a C-library (faster than numpy) from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI

我有一个相对简单的问题（我想）。我正在编写一段Cython代码，当给定应变和特定方向（即，对于一定量的应变，平行于给定方向的半径）时，计算应变椭圆的半径。该函数在每个程序运行期间被调用了数百万次，分析表明，从性能角度讲，该函数是限制因素。代码如下：

# importing math functions from a C-library (faster than numpy)
from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI

cdef class funcs:

    cdef inline double get_r(self, double g, double omega):
        # amount of strain: g, angle: omega
        cdef double l1, l2, A, r, g2, gs   # defining some variables
        if g == 0: return 1   # no strain means the strain ellipse is a circle
        omega = omega*M_PI/180   # converting angle omega to radians
        g2 = g*g
        gs = g*sqrt(4 + g2)
        l1 = 0.5*(2 + g2 + gs)   # l1 and l2: eigenvalues of the Cauchy strain tensor
        l2 = 0.5*(2 + g2 - gs)
        A = acos(g/sqrt(g2 + (1 - l2)**2))   # orientation of the long axis of the ellipse
        r = 1./sqrt(sqrt(l2)*(cos(omega - A)**2) + sqrt(l1)*(sin(omega - A)**2))   # the radius parallel to omega
        return r   # return of the jedi

运行此代码每次调用大约需要0.18微秒，我认为对于这样一个简单的函数来说这有点长。另外，

math.h

有一个平方（x）函数，但我无法从

libc.math

库导入它，有人知道怎么导入吗？对于进一步改进这段代码的性能，还有其他建议吗

2013/09/04更新：

似乎有更多的事情在起作用。当我分析一个调用

get\r

1000万次的函数时，我得到的性能与调用另一个函数不同。我已经添加了部分代码的更新版本。当我使用

get\u r\u profile

进行评测时，每次调用

get\u r

我得到0.073微秒，而

MC\u criteria\u profile

给我大约0.164微秒/调用

get\r

，这50%的差异似乎与

返回r

的间接成本有关

from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI

cdef class thesis_funcs:

    cdef inline double get_r(self, double g, double omega):
        cdef double l1, l2, A, r, g2, gs, cos_oa2, sin_oa2
        if g == 0: return 1
        omega = omega*SCALEDPI
        g2 = g*g
        gs = g*sqrt(4 + g2)
        l1 = 0.5*(2 + g2 + gs)
        l2 = l1 - gs
        A = acos(g/sqrt(g2 + square(1 - l2)))
        cos_oa2 = square(cos(omega - A))
        sin_oa2 = 1 - cos_oa2
        r = 1.0/sqrt(sqrt(l2)*cos_oa2 + sqrt(l1)*sin_oa2)
        return r

    @cython.profile(False)
    cdef inline double get_mu(self, double r, double mu0, double mu1):
        return mu0*exp(-mu1*(r - 1))

    def get_r_profile(self): # Profiling through this guy gives me 0.073 microsec/call
        cdef unsigned int i
        for i from 0 <= i < 10000000:
            self.get_r(3.0, 165)

    def MC_criterion(self, double g, double omega, double mu0, double mu1, double C = 0.0):
        cdef double r, mu, theta, res
        r = self.get_r(g, omega)
        mu = self.get_mu(r, mu0, mu1)
        theta = 45 - omega
        theta = theta*SCALEDPI
        res = fabs(g*sin(2.0*theta)) - mu*(1 + g*cos(2.0*theta)) - C
        return res

    def MC_criterion_profile(self): # Profiling through this one gives 0.164 microsec/call
        cdef double g, omega, mu0, mu1
        cdef unsigned int i
        omega = 165
        mu0 = 0.6
        mu1 = 2.0
        g = 3.0
        for i from 1 <= i < 10000000:
            self.MC_criterion(g, omega, mu0, mu1)

来自libc.math cimport sin、cos、acos、exp、sqrt、fabs、M_PI
cdef课程论文功能：
cdef内联双get_r（自身、双g、双ω）：
cdef双l1、l2、A、r、g2、gs、cos_oa2、sin_oa2
如果g==0：返回1
ω=ω*标度π
g2=g*g
gs=g*sqrt（4+g2）
l1=0.5*（2+g2+gs）
l2=l1-gs
A=acos（g/sqrt（g2+square（1-l2）））
cos_oa2=平方（cos（ω-A））
sin_oa2=1-cos_oa2
r=1.0/sqrt（sqrt（l2）*cos_oa2+sqrt（l1）*sin_oa2）
返回r
@cython.profile（假）
cdef内联双get_mu（self、双r、双mu0、双mu1）：
返回mu0*exp（-mu1*（r-1））
def get_r_profile（self）：#通过这家伙的分析，我得到0.073微秒/呼叫
无符号整数i
对于0中的i，这个答案与Cython无关，但应该提到一些可能有用的要点
在真正知道是否需要变量之前定义它们可能并不理想。将“cdef double l1、l2、A、r、g2、gs”移到“if g==0”语句之后
我将确保从“omega=omega*M_PI/180”中，M_PI/180部分只计算一次。例如，一些Python代码：
import timeit
from math import pi

def calc1( omega ):
    return omega * pi / 180

SCALEDPI = pi / 180
def calc2( omega ):
    return omega * SCALEDPI

if __name__ == '__main__':
    took = timeit.repeat( stmt = lambda:calc1( 5 ), number = 10000 )
    print "Min. time: %.4f Max. time: %.4f" % ( min( took ), max( took ) )
    took = timeit.repeat( stmt = lambda:calc2( 5 ), number = 10000 )
    print "Min. time: %.4f Max. time: %.4f" % ( min( took ), max( took ) )

计算1：最小时间：0.0033最大时间：0.0034
计算2：最小时间：0.0029最大时间：0.0029
尝试优化自己的计算。它们看起来相当复杂，我觉得它们可以简化
产量
cython -a

显示已测试0的除法。如果您200%确定此支票不会发生，则可能需要删除此支票
要使用C分区，可以在文件顶部添加以下指令：
# cython: cdivision=True

我想链接官方文档，但现在无法访问。您在这里有一些信息（第15页）：
根据您的评论，行计算r
是最昂贵的。如果是这样，那么我怀疑是trig函数调用破坏了性能
在毕达哥拉斯的著作中，cos（x）**2+sin（x）**2==1
，因此您可以通过计算跳过其中一个调用
cos_oa2 = cos(omega - A)**2
sin_oa2 = 1 - cos_oa2
r = 1. / sqrt(sqrt(l2) * cos_oa2 + sqrt(l1) * sin_oa2)

（或者翻转它们：在我的机器上，sin
似乎比cos
快。不过这可能是个小问题。）
M_-PI/180部分应该由编译器处理。谢谢你的回复。前两个建议没有带来任何性能提升。至于第三个：我认为这些方程不能再简化了：（cdef
s实际上不执行计算，它们必须位于函数的开头。您的
具有square
函数？这不是标准函数，您在哪个平台上？在任何情况下，您都可以定义自己的cdef内联square（x）：返回x*x
以摆脱对pow
的所有调用。如果将get\r
与类解耦，则函数调用可能会更快。如果不需要，则无需传递self
。@larsmans尝试了这两种建议，但没有显著的收益。最后一行r=1./…
是最昂贵的ve，所以我需要先降低这一个的成本。为了比较速度，我在普通C中编译了你的get_r
和get_r_profile
函数。普通Cget_r_profile
似乎比Cython快2.5-3，所以可能可以选择单独编译get_r
，并从Cython链接到它？有趣的是，取决于优化标志，编译器似乎完全跳过了对get\r
的调用！不知道这与Cython生成的C代码的结果如何，但可能类似的情况会导致您看到的时间差异。感谢您的回复。使用该装饰器不会带来任何性能提升。此外，我不确定a除以零永远不会发生。在我链接的PDF中还有其他提示，例如对编译器使用更积极的优化。如果输入值可能在程序运行的数百万次中重复，您可以尝试将结果缓存在字典中。另一种方法是使用cos（）和sin（）的表但这取决于您需要的数值精度。谢谢，这将使性能提高约10%。我刚才还注意到，returnr
消耗了大约50%的计算时间！如果我将returnr
替换为return 0.5
，我将获得巨大的速度，因此我怀疑这涉及到一些C-to-Python开销。@MPA:strange，这不应该发生在声明返回类型的cdef
函数中，我想