Python比字典更快吗？_Python_Performance_Dictionary_Nlp

Python比字典更快吗？

python performance dictionary nlp

Python比字典更快吗？,python,performance,dictionary,nlp,Python,Performance,Dictionary,Nlp,我正在使用朴素贝叶斯分类器制作一个简单的情感挖掘系统为了训练我的分类器，我有一个文本文件，其中每行包含一个标记列表（从tweet生成）和相关的情绪（0表示-ve，4表示肯定）例如： 0 @ switchfoot http : //twitpic.com/2y1zl - Awww , that 's a bummer . You shoulda got David Carr of Third Day to do it . ; D 0 spring break in plain city ...

我正在使用

朴素贝叶斯分类器

制作一个简单的情感挖掘系统

为了训练我的分类器，我有一个文本文件，其中每行包含一个标记列表（从tweet生成）和相关的情绪（0表示-ve，4表示肯定）

例如：

0 @ switchfoot http : //twitpic.com/2y1zl - Awww , that 's a bummer . You shoulda got David Carr of Third Day to do it . ; D
0 spring break in plain city ... it 's snowing
0 @ alydesigns i was out most of the day so did n't get much done
0 some1 hacked my account on aim now i have to make a new one
0 really do n't feel like getting up today ... but got to study to for tomorrows practical exam ...

现在，我要做的是计算每个标记在一条正面推文中出现的次数，以及在一条负面推文中出现的次数。然后我计划使用这些计数来计算概率。我正在使用内置字典来存储这些计数。键是令牌，值是大小为2的整数数组

问题是，这段代码开始的速度非常快，但速度越来越慢，当它处理了大约20万条tweet时，速度变得非常慢——大约每秒1条tweet。因为我的训练集有160万条推特，这太慢了。我的代码是：

def compute_counts(infile):
    f = open(infile)
    counts = {}
    i = 0
    for line in f:
        i = i + 1
        print(i)
        words = line.split(' ')
        for word in words[1:]:
            word = word.replace('\n', '').replace('\r', '')
            if words[0] == '0':
                if word in counts.keys():
                    counts[word][0] += 1
                else:
                    counts[word] = [1, 0]
            else:
                if word in counts.keys():
                    counts[word][1] += 1
                else:
                    counts[word] = [0, 1]
    return counts

我可以做些什么来加快这个过程？更好的数据结构

编辑：不是重复，问题不是一般情况下比dict快的东西，而是在这个特定的用例中。

如果单词在counts.keys（）中，不要使用


如果你这样做，你最终会按顺序查看键，这是dict
应该避免的
只要把放在单词计数中就行了

或者使用defaultdict。
在Python2中，dict.keys（）
创建一个列表，这个操作可能和搜索一样昂贵。这并不是因为字典太慢。defaultdict工作得很好。早些时候，我花了大约4个小时来处理20万条生产线，但现在整个160万条生产线都在一分钟内完成了。谢谢使用counts[key]=counts.get（key，default=None）代替检查key是否存在。\n如果key不存在，您可以提供默认值，它将使用默认值创建。您可以使用两个集合。Counter
代替一个列表字典。