Python Fortran、Numpy、Cython和Numexpr的性能比较_Python_Numpy_Fortran_Cython

Python Fortran、Numpy、Cython和Numexpr的性能比较

python numpy fortran

Python Fortran、Numpy、Cython和Numexpr的性能比较,python,numpy,fortran,cython,Python,Numpy,Fortran,Cython,我有以下功能： def get_denom(n_comp,qs,x,cp,cs): ''' len(n_comp) = 1 # number of proteins len(cp) = n_comp # protein concentration len(qp) = n_comp # protein capacity len(x) = 3*n_comp + 1 # fit parameters len(cs) = 1 ''' k = x[0:n_comp] sigma = x[

我有以下功能：

def get_denom(n_comp,qs,x,cp,cs):
'''
len(n_comp) = 1 # number of proteins
len(cp) = n_comp # protein concentration
len(qp) = n_comp # protein capacity
len(x) = 3*n_comp + 1 # fit parameters
len(cs) = 1

'''
    k = x[0:n_comp]
    sigma = x[n_comp:2*n_comp]
    z = x[2*n_comp:3*n_comp]

    a = (sigma + z)*( k*(qs/cs)**(z-1) )*cp
    denom = np.sum(a) + cs
    return denom

我将其与Fortran实现（我的第一个Fortran函数）进行比较：

我使用以下命令编译.f95文件：

1）

f2py-c-m get\u denom get\u denom.f95--fcompiler=gfortran

2）

f2py-c-m get\u denom\u vec get\u denom.f95--fcompiler=gfortran--f90flags='-msse2'

（最后一个选项应启用自动矢量化）

我通过以下方式测试功能：

import numpy as np
import get_denom as fort_denom
import get_denom_vec as fort_denom_vec
from matplotlib import pyplot as plt
%matplotlib inline

def get_denom(n_comp,qs,x,cp,cs):
    k = x[0:n_comp]
    sigma = x[n_comp:2*n_comp]
    z = x[2*n_comp:3*n_comp]
    # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992)
    a = (sigma + z)*( k*(qs/cs)**(z-1) )*cp
    denom = np.sum(a) + cs
    return denom

n_comp = 100
cp = np.tile(1.243,n_comp)
cs = 100.
qs = np.tile(1100.,n_comp)
x= np.random.rand(3*n_comp+1)
denom = np.empty(1)
%timeit get_denom(n_comp,qs,x,cp,cs)
%timeit fort_denom.get_denom(qs,x,cp,cs,n_comp)
%timeit fort_denom_vec.get_denom(qs,x,cp,cs,n_comp)

我添加了以下Cython代码：

import cython
# import both numpy and the Cython declarations for numpy
import numpy as np
cimport numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def get_denom(int n_comp,np.ndarray[double, ndim=1, mode='c'] qs, np.ndarray[double, ndim=1, mode='c'] x,np.ndarray[double, ndim=1, mode='c'] cp, double cs):

    cdef int i
    cdef double a
    cdef double denom   
    cdef double[:] k = x[0:n_comp]
    cdef double[:] sigma = x[n_comp:2*n_comp]
    cdef double[:] z = x[2*n_comp:3*n_comp]
    # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992)
    a = 0.
    for i in range(n_comp):
    #a += (sigma[i] + z[i])*( pow( k[i]*(qs[i]/cs), (z[i]-1) ) )*cp[i]
        a += (sigma[i] + z[i])*( k[i]*(qs[i]/cs)**(z[i]-1) )*cp[i]

    denom = a + cs

    return denom

编辑：

添加了Numexpr，使用一个线程：

def get_denom_numexp(n_comp,qs,x,cp,cs):
    k = x[0:n_comp]
    sigma = x[n_comp:2*n_comp]
    z = x[2*n_comp:3*n_comp]
    # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992)
    a = ne.evaluate('(sigma + z)*( k*(qs/cs)**(z-1) )*cp' )
    return cs + np.sum(a)

ne.set_num_threads(1)  # using just 1 thread
%timeit get_denom_numexp(n_comp,qs,x,cp,cs)

结果是（越小越好）：

为什么随着阵列大小的增加，Fortran的速度越来越接近Numpy？我怎样才能加快赛顿的速度？使用指针？

注释中没有足够的信息，但以下几点可能会有所帮助：

1） Fortran优化了内部函数，如“Sum（）”和“Dot_乘积”，您可能希望使用这些函数来代替Do循环进行求和等

在某些情况下（这里不一定），使用ForAll或其他方法创建要求和的“元”数组，然后在“元”数组上应用求和可能“更好”

但是，Fortran允许数组部分，因此不需要创建自动/中间数组sigma、k和z以及相关开销。相反，我们可以有类似

n_compP1 = n_comp+1
n_compT2 = n_comp*2
a = Sum( x(1:n_comp)+2*x(n_compP1,n_compT2) )   ! ... just for example

2）有时（取决于编译器、机器等），如果数组大小不在特定的“二进制间隔”（例如1024对1000）等，则可能存在“内存冲突”

您可能希望在图表中的几个点（即在其他各种“n_comp”）重复您的实验，尤其是在这些“边界”附近

3）无法判断您是否正在使用完整的编译器优化（标志）来编译fortran代码。您可能希望查看各种“-o”标志等

4）您可能希望包含OpemMP指令（或至少在您的标志中包含openmp等）。这有时可以改善某些开销问题，即使在循环中不明确依赖OpenMP指令等

5）常规：这可能适用于使用循环的每个方法

a） “求和”公式中的“常量运算”可以在循环外执行，例如，创建类似QSDC=qs/cs的内容，并在循环中使用QSDC

b）类似地，有时创建zM1（：）=z（：）-1之类的东西是有用的，并在循环中使用zM1（：）。

继我前面的答案和Vladimir的弱推测之后，我设置了两个s/r：一个是原始给定的，另一个是使用数组部分和Sum（）。我还想证明Vladimir关于Do循环优化的言论是软弱的

另外，我通常要指出的一点是，上面所示示例中的n_comp的大小太小。下面的结果将每个“原始”和“更好”的SumArraySection（SAS）变体放入在计时调用中重复1000次的循环中，因此结果是每个s/r的1000个Calc。如果你的计时是几分之一秒，那么它们很可能是不可靠的

有许多变体值得考虑，没有一个具有显式指针。本插图中使用的一种变体是

subroutine get_denomSAS (qs,x,cp,cs,n_comp,denom)

! Calculates the denominator in the SMA model (Brooks and Cramer 1992)
! The function is called at a specific salt concentration and isotherm point
! I loops over the number of components

implicit none

! declaration of input variables
integer, intent(in) :: n_comp ! number of components
double precision, intent(in) :: cs,qs ! salt concentration, free ligand concentration
double precision, Intent(In)            :: cp(:)
double precision, Intent(In)            :: x(:)

! declaration of local variables
integer :: i

! declaration of outpur variables
double precision, intent(out) :: denom
!
!
double precision                        :: qsDcs
!
!
qsDcs = qs/cs
!
denom = Sum( (x(n_comp+1:2*n_comp) + x(2*n_comp+1:3*n_comp))*(x(1:n_comp)*(qsDcs) &
                                            **(x(2*n_comp+1:3*n_comp)-1))*cp(1:n_comp) ) + cs
!
!
end subroutine get_denomSAS

主要区别在于：

a）传递的数组是（：） b）在s/r中没有数组分配，而是使用数组节（相当于“有效”指针）。 c）使用Sum（）而不是Do

然后还可以尝试两种不同的编译器优化来演示其含义

如两张图表所示，orig代码（蓝色菱形）的c.f.SAS（红色方块）速度慢得多，优化程度低。SAS在高度优化的情况下仍然更好，但它们正在接近。当使用低编译器优化时，Sum（）被“更好地优化”，这在一定程度上解释了这一点

黄线表示两个s/r计时之间的比率。忽略顶部图像中“0”处的黄线值（n_comp太小导致其中一个计时不稳定）

因为我没有用户的原始数据与Numpy的比率，所以我只能声明他的图表上的SAS曲线应该低于他当前的Fortran结果，并且可能更平坦，甚至呈下降趋势

换言之，可能实际上不存在原始帖子中看到的分歧，或者至少不存在这种程度

。。。尽管更多的实验可能有助于展示已经提供的其他评论/答案

亲爱的莫里茨：哎呀，我忘了提到，关于你关于指针的问题。如前所述，SAS变体改进的一个关键原因是它更好地使用了“有效指针”，因为它不需要将数组x（）重新分配到三个新的本地数组中（即，由于x是通过ref传递的，使用数组节是Fortran中内置的一种指针方法，因此不需要显式指针），但是需要Sum（）或Dot_乘积（）或其他什么

相反，您可以通过将x更改为n_compx3 2D数组或直接传递n_comp顺序的三个显式1D数组来保持Do并实现类似的操作。这个决定很可能是由代码的大小和复杂度决定的，因为它需要重写调用/sr语句等，并且在其他任何地方使用x（）。我们的一些项目的代码行数超过300000行，因此在这种情况下，在本地更改代码（如SAS等）的成本要低得多

我还在等待在我们的一个盒子上安装Numpy的许可。如前所述，这是一个有趣的原因，为什么你的相对计时意味着Numpy随着n_comp的增加而改善

当然，关于“适当”基准测试等的评论，以及使用fpy所隐含的编译器开关的问题仍然适用，因为这些可能会极大地改变结果的特征

我很想看看你的结果，如果他们更新了这些permut

subroutine get_denomSAS (qs,x,cp,cs,n_comp,denom)

! Calculates the denominator in the SMA model (Brooks and Cramer 1992)
! The function is called at a specific salt concentration and isotherm point
! I loops over the number of components

implicit none

! declaration of input variables
integer, intent(in) :: n_comp ! number of components
double precision, intent(in) :: cs,qs ! salt concentration, free ligand concentration
double precision, Intent(In)            :: cp(:)
double precision, Intent(In)            :: x(:)

! declaration of local variables
integer :: i

! declaration of outpur variables
double precision, intent(out) :: denom
!
!
double precision                        :: qsDcs
!
!
qsDcs = qs/cs
!
denom = Sum( (x(n_comp+1:2*n_comp) + x(2*n_comp+1:3*n_comp))*(x(1:n_comp)*(qsDcs) &
                                            **(x(2*n_comp+1:3*n_comp)-1))*cp(1:n_comp) ) + cs
!
!
end subroutine get_denomSAS

qsDcs = qs/cs

denom = 0
do j = 1, n_comp
  denom = denom + (x(n_comp+j) + x(2*n_comp+j)) * (x(j)*(qsDcs)**(x(2*n_comp+j)-1))*cp(j)
end do

denom = denom + cs

f2py -c -m sas  sas.f90 --opt='-Ofast'
f2py -c -m dos  dos.f90 --opt='-Ofast'


In [24]: %timeit test_sas(10000)
1000 loops, best of 3: 796 µs per loop

In [25]: %timeit test_sas(10000)
1000 loops, best of 3: 793 µs per loop

In [26]: %timeit test_dos(10000)
1000 loops, best of 3: 795 µs per loop

In [27]: %timeit test_dos(10000)
1000 loops, best of 3: 797 µs per loop

  <bb 8>:
  # val.8_59 = PHI <val.8_49(9), 0.0(7)>
  # ivtmp.18_123 = PHI <ivtmp.18_122(9), 0(7)>
  # ivtmp.25_121 = PHI <ivtmp.25_120(9), ivtmp.25_117(7)>
  # ivtmp.28_116 = PHI <ivtmp.28_115(9), ivtmp.28_112(7)>
  _111 = (void *) ivtmp.25_121;
  _32 = MEM[base: _111, index: _106, step: 8, offset: 0B];
  _36 = MEM[base: _111, index: _99, step: 8, offset: 0B];
  _37 = _36 + _32;
  _40 = MEM[base: _111, offset: 0B];
  _41 = _36 - 1.0e+0;
  _42 = __builtin_pow (qsdcs_18, _41);
  _97 = (void *) ivtmp.28_116;
  _47 = MEM[base: _97, offset: 0B];
  _43 = _40 * _47;
  _44 = _43 * _42;
  _48 = _44 * _37;
  val.8_49 = val.8_59 + _48;
  ivtmp.18_122 = ivtmp.18_123 + 1;
  ivtmp.25_120 = ivtmp.25_121 + _118;
  ivtmp.28_115 = ivtmp.28_116 + _113;
  if (ivtmp.18_122 == _96)
    goto <bb 10>;
  else
    goto <bb 9>;

  <bb 9>:
  goto <bb 8>;

  <bb 10>:
  # val.8_13 = PHI <val.8_49(8), 0.0(6)>
  _51 = val.8_13 + _17;
  *denom_52(D) = _51;

import numpy as np
import dos as dos
import sas as sas
from matplotlib import pyplot as plt
import timeit
import numexpr as ne

#%matplotlib inline



ne.set_num_threads(1)

def test_n(n_comp):

    cp = np.tile(1.243,n_comp)
    cs = 100.
    qs = np.tile(1100.,n_comp)
    x= np.random.rand(3*n_comp+1)

    def test_dos():
        denom = np.empty(1)
        dos.get_denomsas(qs,x,cp,cs,n_comp)


    def test_sas():
        denom = np.empty(1)
        sas.get_denomsas(qs,x,cp,cs,n_comp)

    def get_denom():
        k = x[0:n_comp]
        sigma = x[n_comp:2*n_comp]
        z = x[2*n_comp:3*n_comp]
        # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992)
        a = (sigma + z)*( k*(qs/cs)**(z-1) )*cp
        denom = np.sum(a) + cs
        return denom

    def get_denom_numexp():
        k = x[0:n_comp]
        sigma = x[n_comp:2*n_comp]
        z = x[2*n_comp:3*n_comp]
        loc_cp = cp
        loc_cs = cs
        loc_qs = qs
        # calculates the denominator in Equ 14a - 14c (Brooks & Cramer 1992)
        a = ne.evaluate('(sigma + z)*( k*(loc_qs/loc_cs)**(z-1) )*loc_cp' )
        return cs + np.sum(a)

    print 'py', timeit.Timer(get_denom).timeit(1000000/n_comp)
    print 'dos', timeit.Timer(test_dos).timeit(1000000/n_comp)
    print 'sas', timeit.Timer(test_sas).timeit(1000000/n_comp)
    print 'ne', timeit.Timer(get_denom_numexp).timeit(1000000/n_comp)


def test():
    for n in [10,100,1000,10000,100000,1000000]:
        print "-----"
        print n
        test_n(n)

            py              dos             sas             numexpr
10          11.2188110352   1.8704519272    1.8659651279    28.6881871223
100         1.6688809395    0.6675260067    0.667083025     3.4943861961
1000        0.7014708519    0.5406000614    0.5441288948    0.9069931507
10000       0.5825948715    0.5269498825    0.5309231281    0.6178650856
100000      0.5736029148    0.526198864     0.5304090977    0.5886831284
1000000     0.6355218887    0.5294830799    0.5366530418    0.5983200073
10000000    0.7903120518    0.5301260948    0.5367569923    0.6030929089