在Python中使用pstats和cProfile。如何使阵列工作得更快？_Python_Optimization_Profiling_Cprofile_Pstats

在Python中使用pstats和cProfile。如何使阵列工作得更快？

python optimization

在Python中使用pstats和cProfile。如何使阵列工作得更快？,python,optimization,profiling,cprofile,pstats,Python,Optimization,Profiling,Cprofile,Pstats,这是我第一次对代码进行优化，我对此感到兴奋。读了一些文章，但我还有一些问题 1）首先，我下面的代码花了这么多时间做什么？我想这里是数组：array.append（len（set（line.split（）））。我在网上看到，python中的列表工作得更快，但我看不到在这里使用列表。有人知道如何改进吗 2）我还缺少其他改进吗 3）另外，在线上它说for循环大大降低了代码的速度。这里可以改进吗？（我想用C写代码最好，但是：D） 4）为什么人们总是建议看“NCALL”和“tottime”？对我来

这是我第一次对代码进行优化，我对此感到兴奋。读了一些文章，但我还有一些问题

1）首先，我下面的代码花了这么多时间做什么？我想这里是数组：array.append（len（set（line.split（）））。我在网上看到，python中的列表工作得更快，但我看不到在这里使用列表。有人知道如何改进吗

2）我还缺少其他改进吗

3）另外，在线上它说for循环大大降低了代码的速度。这里可以改进吗？（我想用C写代码最好，但是：D）

4）为什么人们总是建议看“NCALL”和“tottime”？对我来说，“珀卡尔”更有意义。它告诉您函数或调用的速度有多快

5）在这里的正确答案B班，他申请了名单。是吗？对我来说，我仍然看到了一个数组和一个For循环，它们被认为会减慢速度。

多谢各位

新的cProfile结果：

 618384 function calls in 9.966 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    19686    3.927    0.000    4.897    0.000 <ipython-input-120-d8351bb3dd17>:14(f)
    78744    3.797    0.000    3.797    0.000 {numpy.core.multiarray.array}
    19686    0.948    0.000    0.948    0.000 {range}
    19686    0.252    0.000    0.252    0.000 {method 'partition' of 'numpy.ndarray' objects}
    19686    0.134    0.000    0.930    0.000 function_base.py:2896(_median)
        1    0.126    0.126    9.965    9.965 <ipython-input-120-d8351bb3dd17>:22(<module>)
    19686    0.125    0.000    0.351    0.000 _methods.py:53(_mean)
    19686    0.120    0.000    0.120    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    19686    0.094    0.000    4.793    0.000 function_base.py:2747(_ureduce)
    19686    0.071    0.000    0.071    0.000 {method 'flatten' of 'numpy.ndarray' objects}
    19686    0.065    0.000    0.065    0.000 {method 'format' of 'str' objects}
    78744    0.055    0.000    3.852    0.000 numeric.py:464(asanyarray)

Numpy使用的是O（n logn）。您每行调用

numpy.meadian

一次，因此您的算法最终为O（n^2 logn）

有几种方法可以改进这一点。一种是保持数组排序（即在保持排序顺序的位置插入每个元素）。每次插入取O（n）（插入到数组中是一个线性时间操作），得到排序数组的中值是O（1），因此结果是O（n^2）

对于评测，您要查看的主要内容是

tottime

，因为这会告诉您程序在函数中总共花费了多少时间。在您的示例中，

percall

有时不是很有用，因为有时，如果您有一个缓慢的函数（high

percall

），但它只被调用了几次（low

numcalls

），那么与其他函数相比，

tottime

最终是微不足道的

你是根据cProfile输出得出这个结论的吗？如果是，你怎么看？这一行是唯一的提示，说明为什么“numpy.core.multiarray.array”需要这么长时间。但它并没有提到中值函数，中值函数是numpy库的一部分{numpy.core.multiarray.array}是花费在numpy库中所有内容上的时间。我得出这个结论是因为在循环中只有一个对

numpy

库函数的调用，即对

numpy.median

的调用。您可以通过在新函数中包装对

median

的调用来检查这一点，并在cProfile.Hm中查看在该函数中花费的

tottime

！！非常有趣。我在数组中实现了排序插入。现在运行代码的总时间从11.873下降到9.966。我将在顶部发布我的新解决方案。看起来我的代码在{numpy.core.multiarray.array}上花费了3.797秒对9.309秒，但是我创建的函数又花费了3.927秒。因此，3.797+3.927=7.724秒。这是一样的。我做错什么了吗？首先，您应该尝试使用内置的，因为python的实现应该更加优化。其次，看起来

numpy

仍在做大量工作来查找排序数组的中值

numpy

不知道数组已排序，因此将在已排序的数组上调用其内部排序算法（我相信它是快速排序），由于您的数组已排序，因此速度更快。但是，您可以完全跳过

numpy

，因为只需取中间（续）索引即可找到排序数组的中值。您可以获取数组的

len

（这是常数时间），如果

len

是奇数，只需返回

array[len/2]

。如果

len

为偶数，则需要取中间两个元素的平均值。

import numpy
import cProfile

pr = cProfile.Profile()
pr.enable()

#paths to files
read_path = '../tweet_input/tweets.txt'
write_path = "../tweet_output/ft2.txt"


def f(a):  
    for i in range(0, len(array)):
        if a <= array[i]:
            array.insert(i, a)
            break
    if 0 == len(array):
        array.append(a)

try:
    with open(read_path) as inf, open(write_path, 'a') as outf:
        array = []
        #for every line (tweet) in the file
        for line in inf:                                            ###Loop is bad. Builtin function is good
            #append amount of unique words to the array
            wordCount = len(set(line.split()))
            #print wordCount, array
            f(wordCount)
            #write current median of the array to the file
            result = "{:.2f}\n".format(numpy.median(array))
            outf.write(result)
except IOError as e:
    print 'Operation failed: %s' % e.strerror


###Service
pr.disable()
pr.print_stats(sort = 'time')

    with open(read_path) as inf, open(write_path, 'a') as outf:
        array = []
        #for every line in the file
        for line in inf:                            
            #append amount of unique words to the array
            array.append(len(set(line.split())))
            #write current median of the array to the file
            result = "{:.2f}\n".format(numpy.median(array))
            outf.write(result)