使用字典计算Python中的词频效率_Python_Python 2.7_Dictionary_Data Structures

使用字典计算Python中的词频效率

python python-2.7 dictionary data-structures

使用字典计算Python中的词频效率,python,python-2.7,dictionary,data-structures,Python,Python 2.7,Dictionary,Data Structures,我的任务是找出列表中每个单词的频率。有两种方法方法1 def f(words): wdict = {} for word in words: if word not in wdict: wdict[word] = 0 wdict[word] += 1 return wdict 方法2 def g(words): wdict = {} for word in words: try:

我的任务是找出列表中每个单词的频率。有两种方法

方法1

def f(words):
    wdict = {}
    for word in words:
        if word not in wdict:
            wdict[word] = 0
        wdict[word] += 1
    return wdict

方法2

def g(words):
    wdict = {}
    for word in words:
        try:
            wdict[word] += 1
        except KeyError:
            wdict[word] = 1

为什么方法2是有效的？不是在这两种情况下，哈希函数调用的数量都相同，这与此相反吗？

让我们模拟一些情况

例如：“一只鸟在飞”

在第一种方法中： 对于每个单词，它将在字典中搜索3次，因此它将访问总计3*

len（单词）

或3*4=12

在第二种方法中： 如果找不到，只搜索2次；否则1次：so 2*4=8

理论上，两者具有相同的时间复杂性

更新：

谢谢你的指点。实际上，方法1应该比方法2更有效。Python字典使用hashmap，所以访问密钥的复杂性将是O（n），但在一般情况下，它是O（1）。而CPython的实现是相当高效的。另一方面，try/catch异常处理很慢

您可以在方法1中使用更干净的代码。

让我们模拟一些情况

例如：“一只鸟在飞”

在第一种方法中： 对于每个单词，它将在字典中搜索3次，因此它将访问总计3*

len（单词）

或3*4=12

在第二种方法中： 如果找不到，只搜索2次；否则1次：so 2*4=8

理论上，两者具有相同的时间复杂性

更新：

您可以在方法1中使用更干净的代码。

这取决于输入。如果平均来说，大多数单词都已经出现在dict中，那么你就不会有很多例外。如果大多数单词是唯一的，那么异常的开销将使第二种方法变慢。

这取决于输入。如果平均来说，大多数单词都已经出现在dict中，那么你就不会有很多例外。如果大多数单词都是唯一的，那么异常的开销会使第二种方法的速度变慢。

有几种方法可以解决这个问题。您可以使用循环，仍然可以得到预期的答案。我关注两种方法：

列表理解

wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'
wordlist = wordstring.split()
# Count each word
wordfreq = [wordlist.count(w) for w in wordlist] # a list comprehension
# Convert to set to remove repetitions
frequencies=set(zip(wordlist, wordfreq))
print(frequencies)

输出：

{('of', 4), ('best', 1), ('the', 4), ('worst', 1), ('age', 2), ('wisdom', 1), ('it', 4), ('was', 4), ('times', 2), ('foolishness', 1)}

Counter({'it': 4, 'was': 4, 'the': 4, 'of': 4, 'times': 2, 'age': 2, 'best': 1, 'worst': 1, 'wisdom': 1, 'foolishness': 1})

方法二：标准库

import collections
wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'
wordlist = wordstring.split()
# Count frequency
freq=collections.Counter(wordlist)
print(freq)

输出：

{('of', 4), ('best', 1), ('the', 4), ('worst', 1), ('age', 2), ('wisdom', 1), ('it', 4), ('was', 4), ('times', 2), ('foolishness', 1)}

Counter({'it': 4, 'was': 4, 'the': 4, 'of': 4, 'times': 2, 'age': 2, 'best': 1, 'worst': 1, 'wisdom': 1, 'foolishness': 1})

选择的方法取决于正在处理的文本的大小。上述方法适用于小文本。

有几种方法可以解决这个问题。您可以使用循环，仍然可以得到预期的答案。我关注两种方法：

列表理解

wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'
wordlist = wordstring.split()
# Count each word
wordfreq = [wordlist.count(w) for w in wordlist] # a list comprehension
# Convert to set to remove repetitions
frequencies=set(zip(wordlist, wordfreq))
print(frequencies)

输出：

{('of', 4), ('best', 1), ('the', 4), ('worst', 1), ('age', 2), ('wisdom', 1), ('it', 4), ('was', 4), ('times', 2), ('foolishness', 1)}

Counter({'it': 4, 'was': 4, 'the': 4, 'of': 4, 'times': 2, 'age': 2, 'best': 1, 'worst': 1, 'wisdom': 1, 'foolishness': 1})

方法二：标准库

import collections
wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'
wordlist = wordstring.split()
# Count frequency
freq=collections.Counter(wordlist)
print(freq)

输出：

{('of', 4), ('best', 1), ('the', 4), ('worst', 1), ('age', 2), ('wisdom', 1), ('it', 4), ('was', 4), ('times', 2), ('foolishness', 1)}

Counter({'it': 4, 'was': 4, 'the': 4, 'of': 4, 'times': 2, 'age': 2, 'best': 1, 'worst': 1, 'wisdom': 1, 'foolishness': 1})

选择的方法取决于正在处理的文本的大小。上述方法适用于较小的文本大小。

有两个主要区别：

方法1将对每个单词执行
```
in
```
操作，而方法2将尽可能直接更新
每当Method1插入一个新词时，计数将在以后更新。方法2从1开始计数

这最终取决于输入，但如果有足够的重复次数，操作就会减少

示例：
让我们在这里通读一下代码，以了解总体思路（而不是实际操作）

['a'，'a']

方法1
1-“a”不在wdict中-正确
2-分配“a”
3-更新“a”
4-“a”不在dict中-错误
5-更新“a”

方法2
1-访问“a”
2-错误
3-将“a”直接分配给1
4-更新“a”（第二个“a”）

虽然这些步骤并不完全是执行时进行的操作数量，但它们表明Method2更精简，所经历的“步骤”更少。

有两个主要区别：

方法1将对每个单词执行
```
in
```
操作，而方法2将尽可能直接更新
每当Method1插入一个新词时，计数将在以后更新。方法2从1开始计数

这最终取决于输入，但如果有足够的重复次数，操作就会减少

示例：
让我们在这里通读一下代码，以了解总体思路（而不是实际操作）

['a'，'a']

方法1
1-“a”不在wdict中-正确
2-分配“a”
3-更新“a”
4-“a”不在dict中-错误
5-更新“a”

方法2
1-访问“a”
2-错误
3-将“a”直接分配给1
4-更新“a”（第二个“a”）

尽管这些步骤并不是执行时执行的操作数量，但它们表明Method2更精简，执行的“步骤”更少。

最有效的可能是标准库中的

计数器：来自集合导入计数器；c=计数器（单词）
。有很多方法可以做到这一点。其中一些方法比方法2
更有效。请参阅collections.defaultDict
或更好的collections.Counter
.Ev.kounis。我的问题是，在哈希函数调用次数方面，Method2为什么比Method1更有效？您只需亲自测试一下：最有效的可能是标准库中的计数器：来自集合导入计数器；c=计数器（单词）
。有很多方法可以做到这一点。其中一些方法比方法2
更有效。请参阅collections.defaultDict
或更好的collections.Counter
.Ev.kounis。我的问题是，在哈希函数调用的数量方面，Method2为什么比Method1更有效呢