为什么Python对于简单的for循环如此缓慢？_Python_Performance_Jit

为什么Python对于简单的for循环如此缓慢？

python performance

为什么Python对于简单的for循环如此缓慢？,python,performance,jit,Python,Performance,Jit,我们正在用Python实现一些kNN和SVD。其他人选择了Java。我们的执行时间非常不同。我用cProfile查看我犯错误的地方，但实际上一切都很正常。是的，我也使用numpy。但我想问一个简单的问题 total = 0.0 for i in range(9999): # xrange is slower according for j in range(1, 9999): #to my test but more memory-friendly.

我们正在用Python实现一些

kNN

和

SVD

。其他人选择了Java。我们的执行时间非常不同。我用cProfile查看我犯错误的地方，但实际上一切都很正常。是的，我也使用

numpy

。但我想问一个简单的问题

total = 0.0
for i in range(9999): # xrange is slower according 
    for j in range(1, 9999):            #to my test but more memory-friendly.
        total += (i / j)
print total

这段代码在我的计算机上花费了31.40秒

Problem= (1+2+...+(num-1)) * (1/1+1/2+...+1/(num-1))
1+2+...+(num-1)=np.sum(np.arange(1,num))=num*(num-1)/2
1/1+1/2+...+1/(num-1)=np.true_divide (1,y)=np.reciprocal(y.astype(np.float64))

在同一台计算机上，此代码的Java版本需要1秒或更短的时间。我想类型检查是这段代码的一个主要问题。但是我应该为我的项目做很多这样的操作，我认为9999*9999并不是一个很大的数字

我认为我犯了错误，因为我知道很多科学项目都在使用Python。但是为什么这段代码这么慢，我如何处理比这更大的问题呢

我是否应该使用JIT编译器，例如

Psyco

编辑我还说这个循环问题只是一个例子。代码并不像这样简单，可能很难将您的改进/代码示例付诸实践

另一个问题是，如果我正确地使用它，我能用

numpy

和

scipy

实现很多数据挖掘和机器学习算法吗

这是一个已知的现象——python代码是动态的和解释的，java代码是静态类型的和编译的。这并不奇怪

人们选择python的原因通常是：

较小的代码库
减少冗余（更干燥）
清洁代码

但是，如果您使用C语言编写的库（来自python），性能可能会更好（比较：

pickle

和

cpickle

）。

因为您提到了科学代码，请看一下

numpy

。您正在做的事情可能已经完成了（或者更确切地说，它使用LAPACK处理SVD之类的事情）。当您听说python被用于科学代码时，人们可能并不是指您在示例中使用它的方式

举个简单的例子：

（如果您使用的是python3，您的示例将使用浮点除法。我的示例假设您使用的是python2.x，因此是整数除法。如果不是，请指定

i=np.arange（9999，dtype=np.float）

，等等）

给一些时间的想法。。。（我将在这里使用浮点除法，而不是像您的示例中那样使用整数除法）：

如果我们比较时间：

In [30]: %timeit f1(9999)
1 loops, best of 3: 27.2 s per loop

In [31]: %timeit f2(9999)
1 loops, best of 3: 1.46 s per loop

In [32]: %timeit f3(9999)
1 loops, best of 3: 915 ms per loop

您会发现列表理解或生成器表达式要快得多。例如：

import numpy as np

def f1(num):
    total = 0.0
    for i in range(num): 
        for j in range(1, num):
            total += (float(i) / j)
    return total

def f2(num):
    i = np.arange(num, dtype=np.float)
    j = np.arange(1, num, dtype=np.float)
    return np.divide.outer(i, j).sum()

def f3(num):
    """Less memory-hungry (and faster) version of f2."""
    total = 0.0
    j = np.arange(1, num, dtype=np.float)
    for i in xrange(num):
        total += (i / j).sum()
    return total

total = sum(i / j for j in xrange(1, 9999) for i in xrange(9999))

这在我的机器上执行约11秒，而您的原始代码执行约26秒。仍然比Java慢一个数量级，但这更符合您的预期

顺便说一下，通过将

total

初始化为

而不是

0.0

来使用整数而不是浮点加法，可以稍微加快原始代码的速度。您的除法都有整数结果，因此将结果相加为浮点没有意义

在我的机器上，Psyco实际上会将生成器表达式的速度降低到与原始循环相同的速度（它根本不会加速）。

为什么在这个示例循环中Java比Python快？

新手解释：想象一个像货运列车一样的程序，它在前进时铺设自己的列车轨道。火车开动前必须铺设轨道。爪哇货运列车可以在列车前方发送数千个轨道层，所有轨道层都可以提前数英里平行铺设，wheras python一次只能发送一名工人，并且只能在列车前方10英尺铺设轨道

Java具有强大的类型，这使编译器能够使用JIT特性：（），这使CPU能够在将来需要指令之前获取内存并并行执行指令。Java可以在某种程度上并行运行for循环中的指令。Python没有具体的类型，因此必须在每一条指令中确定要完成的工作的性质。这会导致整个计算机停止并等待所有变量中的所有内存被重新扫描。python中的循环是多项式的

O（n^2）

time，而Java循环可以是多项式的，并且由于强类型，通常是线性时间O（n）

我认为我犯了错误，因为我知道很多科学项目都在使用Python

他们大量使用SciPy（NumPy是最重要的组件，但我听说围绕NumPy的API开发的生态系统更为重要），这大大加快了这些项目所需的各种操作。这就是你的错误所在：你没有用C语言编写关键代码。Python在总体上非常适合开发，但良好的扩展模块本身就是一个至关重要的优化（至少在你处理数字时是如此）。Python是一种非常糟糕的语言，用于在中实现紧密的内部循环

默认实现（目前是最流行和最广泛支持的）是一个简单的字节码解释器。即使是最简单的操作，如整数除法，也可能需要数百个CPU周期、多个内存访问（类型检查是一个流行的示例）、几个C函数调用等，而不是几条指令（甚至在整数除法的情况下是一条指令）。此外，该语言设计了许多抽象，增加了开销。如果使用xrange，循环会在堆上分配9999个对象，如果使用

range

（对于缓存的小整数，9999999个整数减去大约256256个），则分配的对象会多得多。另外，

xrange

版本在每次迭代时调用一个方法来推进-如果序列上的迭代没有得到特别优化，

range

版本也会这样做。尽管如此，它仍然需要一个完整的字节码分派，这本身非常复杂（当然，与整数除法相比）

看看什么是JIT会很有趣（我推荐PyPy而不是Psyco，后者不是

total = sum(i / j for j in xrange(1, 9999) for i in xrange(9999))

if __name__ =='__main__':
    total = 0.0
    i=1
    while i<=9999:
        j=1
        while j<=9999:
            total=1
            j+=1
        i+=1
    print total

public class Main{
    public static void main(String args[]){
        float total = 0f; 

        long start_time = System.nanoTime();
        int i=1;

        while (i<=9999){
            int j=1;
            while(j<=9999){
                total+=1;
                j+=1;
            }
            i+=1;
        }
        long end_time = System.nanoTime();

        System.out.println("total: " + total);
        System.out.println("total milliseconds: " + 
           (end_time - start_time)/1000000);
    }
}

from __future__ import division

cdef double total = 0.00
cdef int i, j
for i in range(9999):
    for j in range(1, 10000+i):
        total += (i / j)

from time import time
t = time()
print("total = %d" % total)
print("time = %f[s]" % (time() - t))

$ cython loops.pyx
$ gcc -I/usr/include/python2.7 -shared -pthread -fPIC -fwrapv -Wall -fno-strict-aliasing -O3 -o loops.so loops.c
$ python -c "import loops"

total = 514219068
time = 0.000047[s]

%timeit f3(9999)
704 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def f4(num):
    x=np.ones(num-1)
    y=np.arange(1,num)
    return np.sum(np.true_divide(x,y))*np.sum(y)

155 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Problem= (1+2+...+(num-1)) * (1/1+1/2+...+1/(num-1))
1+2+...+(num-1)=np.sum(np.arange(1,num))=num*(num-1)/2
1/1+1/2+...+1/(num-1)=np.true_divide (1,y)=np.reciprocal(y.astype(np.float64))

def f5(num):
    return np.sum(np.reciprocal(np.arange(1, num).astype(np.float64))) * num*(num-1)/2
%timeit f5(9999)
106 µs ± 615 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

1/1+1/2+...+1/(num-1)=np.log(num-1)+1/(2*num-2)+np.euler_gamma
(n>2)

def f6(num):
    return (np.log(num-1)+1/(2*num-2)+np.euler_gamma)* num*(num-1)/2
%timeit f6(9999)
4.82 µs ± 29.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%timeit f3(99999)
56.7 s ± 590 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f5(99999)
534 µs ± 86.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit f5(99999999)
1.42 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
9.498947911958**416**e+16
%timeit f6(99999999)
4.88 µs ± 26.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.498947911958**506**e+16
%timeit f6(9999999999999999999)
17.9 µs ± 921 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

from numba import jit
@jit
def f7(num):
    return (np.log(num-1)+1/(2*num-2)+np.euler_gamma)* num*(num-1)/2
# same code with f6(num)

%timeit f6(999999999999999)
5.63 µs ± 29.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
f7(123) # compile f7(num)
%timeit f7(999999999999999)
331 ns ± 1.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit f7(9999)
286 ns ± 3.09 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


import time


N=100
i=0
j=0

StartTime=time.time()
while j<N:
    j=j+1
    while i<1000000:
        a=float(i)/float(j)
        i=i+1
EndTime=time.time()

DeltaTime=(EndTime-StartTime) # time in seconds


MIPS=(1/DeltaTime)*N



print("This program estimates the MIPS that your computational unit can perform")
print("------------------------------------------")
print("Execution Time in Seconds=",DeltaTime)
print("MIPS=",MIPS) 
print("------------------------------------------")

#include <stdio.h>
#include <time.h>


int main(){

int i,j;
int N=100;
float a, DeltaTime, MIPS;
clock_t StartTime, EndTime;

StartTime=clock();

// This calculates n-time one million divisions

for (j=1;j<N; j++)
 {
    for(i=1;i<1000000;i=i+1)
     {
      a=(float)(i)/(float)(j);
     }
 }


EndTime=clock(); // measures time in microseconds

DeltaTime=(float)(EndTime - StartTime)/1000000;

MIPS=(1/DeltaTime)*N;

printf("------------------------------------------\n");
printf("Execution Time in Seconds=%f \n", DeltaTime);
printf("MIPS=%f \n", MIPS);
printf("------------------------------------------\n");

return 0;

}