Python 如何使此列表运行更快？_Python_Algorithm_List_Optimization_Dictionary

Python 如何使此列表运行更快？

python algorithm list optimization dictionary

Python 如何使此列表运行更快？,python,algorithm,list,optimization,dictionary,Python,Algorithm,List,Optimization,Dictionary,如何使此功能更快？使用集合： def removeDuplicatesFromList(seq): # Not order preserving keys = {} for e in seq: keys[e] = 1 return keys.keys() def countWordDistances(li): ''' If li = ['that','sank','into','the','ocean'] Th

如何使此功能更快？

使用集合：

def removeDuplicatesFromList(seq): 
    # Not order preserving 
    keys = {}
    for e in seq:
        keys[e] = 1
    return keys.keys()

def countWordDistances(li):
    '''
    If li = ['that','sank','into','the','ocean']    
    This function would return: { that:1, sank:2, into:3, the:4, ocean:5 }
    However, if there is a duplicate term, take the average of their positions
    '''
    wordmap = {}
    unique_words = removeDuplicatesFromList(li)
    for w in unique_words:
        distances = [i+1 for i,x in enumerate(li) if x == w]
        wordmap[w] = float(sum(distances)) / float(len(distances)) #take average
    return wordmap

首先想到的是使用集合删除重复的单词：

def countWordDistances(li):
    '''
    If li = ['that','sank','into','the','ocean']    
    This function would return: { that:1, sank:2, into:3, the:4, ocean:5 }
    However, if there is a duplicate term, take the average of their positions
    '''
    wordmap = {}
    unique_words = set(li)
    for w in unique_words:
        distances = [i+1 for i,x in enumerate(li) if x == w]
        wordmap[w] = float(sum(distances)) / float(len(distances)) #take average
    return wordmap

但是，一般来说，如果您担心速度问题，则需要分析函数以查看瓶颈所在，然后尝试减少瓶颈。

使用a而不是

dict

，因为您没有对值执行任何操作：

unique_words = set(li)

单行线-

def removeDuplicatesFromList(seq):
    return frozenset(seq)

我在最后一行所做的是词典理解，类似于列表理解。

使用列表理解：

from __future__ import division   # no need for this if using py3k

def countWordDistances(li):
    '''
    If li = ['that','sank','into','the','ocean']    
    This function would return: { that:1, sank:2, into:3, the:4, ocean:5 }
    However, if there is a duplicate term, take the average of their positions
    '''
    return {w:sum(dist)/len(dist) for w,dist in zip(set(li), ([i+1 for i,x in enumerate(li) if x==w] for w in set(li))) }

我不确定这是否会比使用集合更快，但它只需要一次通过列表：

def countWordDistances(l):
    unique_words = set(l)
    idx = [[i for i,x in enumerate(l) if x==item]
            for item in unique_words]
    return {item:1.*sum(idx[i])/len(idx[i]) + 1.
            for i,item in enumerate(unique_words)}

li = ['that','sank','into','the','ocean']
countWordDistances(li)
# {'into': 3.0, 'ocean': 5.0, 'sank': 2.0, 'that': 1.0, 'the': 4.0}

li2 = ['that','sank','into','the','ocean', 'that']
countWordDistances(li2)
# {'into': 3.0, 'ocean': 5.0, 'sank': 2.0, 'that': 3.5, 'the': 4.0}

这将返回wordmap的修改版本，与每个键关联的值是平均位置和出现次数的元组。显然，您可以轻松地将其转换为原始输出的形式，但这需要一些时间

代码在遍历列表时基本上保持一个运行平均值，每次通过加权平均值重新计算

def countWordDistances(li):
    wordmap = {}
    for i in range(len(li)):
        if li[i] in wordmap:
            avg, num = wordmap[li[i]]
            new_avg = avg*(num/(num+1.0)) + (1.0/(num+1.0))*i
            wordmap[li[i]] = new_avg, num+1
        else:
            wordmap[li[i]] = (i, 1)

    return wordmap

这使列表只通过一次，并使操作保持最少。我在有110万条词条、29k个独特单词的单词列表中计时，它的速度几乎是Patrick答案的两倍。在一个包含10k个单词、2k个唯一单词的列表中，它比OP的代码快了300多倍

要使Python代码运行得更快，需要记住两条规则：使用最佳算法和避免使用Python

在算法方面，将列表迭代一次而不是N+1次（N=唯一单词的数量）是提高速度的主要因素

在“避免Python”方面，我的意思是：您希望您的代码尽可能多地用C语言执行。因此，使用

defaultdict

比显式检查密钥是否存在的dict要好

defaultdict

为您进行检查，但在Python实现中，它是用C进行检查的<对于范围内的i（len（li）），code>enumerate优于

，这也是因为它的Python步骤更少。并且enumerate（li，1）
使计数从1开始，而不必在循环中的某个地方使用Python+1
编辑：第三条规则：使用PyPy。我的代码在PyPy上的运行速度是2.7的两倍。
基于@Ned Batcheld的解决方案，但不创建虚拟列表：
import collections
def countWordDistances(li):
    wordmap = collections.defaultdict(list)
    for i, w in enumerate(li, 1):
        wordmap[w].append(i)
    for k, v in wordmap.iteritems():
        wordmap[k] = sum(v)/float(len(v))

    return wordmap

其他人都建议使用set。使用frozenset有什么好处？@user849364:主要区别在于set
是可变的，而frozenset
是不可变的。我相信这没有性能优势，但它告诉代码的读者，集合不会被修改。有关更多信息，请参阅Python文档。Python 2.7中也提供了字典理解。在此之前，同样的想法也可以通过调用带有生成器理解的dict
，例如，`dict（（i，2*i）for i in range（4））'生成{0:0，1:2，2:4，3:6}。这对您有用吗？我得到“未定义全局名称‘w’”，因为“x==w”在定义w的循环中。只通过列表一次是关键。对于酷优化，+1。我对最好的算法有一个不错的想法，但没有你清楚掌握的Python知识。为什么不积累足够的统计总数和数字呢？我会补充一个答案。@Neil G:干得好，你的比我的快10%左右，并建议另一条规则：避免内存分配。+1表示“避免Python[通过巧妙地使用Python]”-从你的推文中，我希望“避免Python”是关于本机扩展之类的。“使用PyPy”和“避免Python”不能很好地结合在一起。最好是“使用PyPy”和“使用Python”？好吧，这对于一个粗俗的黑客来说是怎样的：如果你只需要两个real的列表，那么就使用一个复数吧！它将您的解决方案的运行时间缩短了25%，但真恶心@内德：哈，是的！您是否尝试过lambda:numpy.zero（2）

？你可能比我更清楚，但有一天，我希望有人编写一个很棒的Python优化器，这样我们就可以专注于算法（这就是我为什么爱上Python的原因）

import collections
def countWordDistances(li):
    wordmap = collections.defaultdict(lambda:[0.0, 0.0])
    for i, w in enumerate(li, 1):
        wordmap[w][0] += i
        wordmap[w][1] += 1.0
    for k, (t, n) in wordmap.iteritems():
        wordmap[k] = t / n
    return wordmap