Python 2.7 创建一个表格，使用python NLTK显示相对频率，并迭代古腾堡语料库中的18个文本_Python 2.7_Nltk

Python 2.7 创建一个表格，使用python NLTK显示相对频率，并迭代古腾堡语料库中的18个文本

python-2.7

Python 2.7 创建一个表格，使用python NLTK显示相对频率，并迭代古腾堡语料库中的18个文本,python-2.7,nltk,Python 2.7,Nltk,我需要创建一个表格，显示NLTK在古腾堡语料库摘录中提供的18篇文本中使用“情态动词”（can、can、may、may、will、will和should）的相对频率这是我的密码 for fileid in gutenberg.fileids(): fdist = nltk.FreqDist(for w in gutenberg.words(fileid)) modals = ['can', 'could', 'may', 'might', 'must', 'will','would',

我需要创建一个表格，显示NLTK在古腾堡语料库摘录中提供的18篇文本中使用“情态动词”（can、can、may、may、will、will和should）的相对频率

这是我的密码

for fileid in gutenberg.fileids():
    fdist = nltk.FreqDist(for w in gutenberg.words(fileid))
modals = ['can', 'could', 'may', 'might', 'must', 'will','would','should']

我需要将fdist制表，其中fileid为“Rows”，modals为“Columns”

TL；DR

很可能这就是您需要的：

[out]：

austen-emma.txt Counter({u'could': 825, u'would': 815, u'must': 564, u'will': 559, u'should': 366, u'might': 322, u'can': 270, u'may': 213})
austen-persuasion.txt Counter({u'could': 444, u'would': 351, u'must': 228, u'should': 185, u'might': 166, u'will': 162, u'can': 100, u'may': 87})
austen-sense.txt Counter({u'could': 568, u'would': 507, u'will': 354, u'must': 279, u'should': 228, u'might': 215, u'can': 206, u'may': 169})
bible-kjv.txt Counter({u'will': 3807, u'may': 1024, u'should': 768, u'might': 475, u'would': 443, u'can': 213, u'could': 165, u'must': 131})
blake-poems.txt Counter({u'can': 20, u'should': 6, u'may': 5, u'would': 3, u'could': 3, u'will': 3, u'might': 2, u'must': 2})
bryant-stories.txt Counter({u'could': 154, u'will': 144, u'would': 110, u'can': 75, u'must': 39, u'should': 38, u'might': 23, u'may': 18})
burgess-busterbrown.txt Counter({u'could': 56, u'would': 46, u'can': 23, u'will': 19, u'might': 17, u'must': 14, u'should': 13, u'may': 3})
carroll-alice.txt Counter({u'could': 73, u'would': 70, u'can': 57, u'must': 41, u'might': 28, u'should': 27, u'will': 24, u'may': 11})
chesterton-ball.txt Counter({u'will': 198, u'would': 139, u'can': 131, u'could': 117, u'may': 90, u'must': 81, u'should': 75, u'might': 69})
chesterton-brown.txt Counter({u'could': 170, u'would': 132, u'can': 126, u'will': 111, u'might': 71, u'must': 70, u'should': 56, u'may': 47})
chesterton-thursday.txt Counter({u'could': 148, u'can': 117, u'would': 116, u'will': 109, u'might': 71, u'may': 56, u'should': 54, u'must': 48})
edgeworth-parents.txt Counter({u'will': 517, u'would': 503, u'could': 420, u'can': 340, u'should': 271, u'must': 250, u'may': 160, u'might': 127})
melville-moby_dick.txt Counter({u'would': 421, u'will': 379, u'must': 282, u'may': 230, u'can': 220, u'could': 215, u'might': 183, u'should': 181})
milton-paradise.txt Counter({u'will': 161, u'may': 116, u'can': 107, u'might': 98, u'must': 66, u'could': 62, u'should': 55, u'would': 49})
shakespeare-caesar.txt Counter({u'will': 129, u'would': 40, u'should': 38, u'may': 35, u'must': 30, u'could': 18, u'can': 16, u'might': 12})
shakespeare-hamlet.txt Counter({u'will': 131, u'would': 60, u'may': 56, u'must': 53, u'should': 52, u'can': 33, u'might': 28, u'could': 26})
shakespeare-macbeth.txt Counter({u'will': 62, u'would': 42, u'should': 41, u'must': 33, u'may': 30, u'can': 21, u'could': 15, u'might': 5})
whitman-leaves.txt Counter({u'will': 261, u'can': 88, u'would': 85, u'may': 85, u'must': 63, u'could': 49, u'should': 42, u'might': 26})

把它们放在桌子上：

fileids would   may could   should  will    can might   must
austen-emma.txt 815 213 825 366 559 270 322 564
austen-persuasion.txt   351 87  444 185 162 100 166 228
austen-sense.txt    507 169 568 228 354 206 215 279
bible-kjv.txt   443 1024    165 768 3807    213 475 131
blake-poems.txt 3   5   3   6   3   20  2   2
bryant-stories.txt  110 18  154 38  144 75  23  39
burgess-busterbrown.txt 46  3   56  13  19  23  17  14
carroll-alice.txt   70  11  73  27  24  57  28  41
chesterton-ball.txt 139 90  117 75  198 131 69  81
chesterton-brown.txt    132 47  170 56  111 126 71  70
chesterton-thursday.txt 116 56  148 54  109 117 71  48
edgeworth-parents.txt   503 160 420 271 517 340 127 250
melville-moby_dick.txt  421 230 215 181 379 220 183 282
milton-paradise.txt 49  116 62  55  161 107 98  66
shakespeare-caesar.txt  40  35  18  38  129 16  12  30
shakespeare-hamlet.txt  60  56  26  52  131 33  28  53
shakespeare-macbeth.txt 42  30  15  41  62  21  5   33
whitman-leaves.txt  85  85  49  42  261 88  26  63

长的：

首先让我们看看

FreqDist

是如何工作的

FreqDist

基本上是一个

collections.Counter

对象，这样我们就可以向它提供一个列表，并对列表中的实例进行计数：

>>> from collections import Counter
>>> from nltk import FreqDist

>>> alist = [1,2,1,2,3,4,5,6,7,2,4,5,6,9]

>>> Counter(alist)
Counter({2: 3, 1: 2, 4: 2, 5: 2, 6: 2, 3: 1, 7: 1, 9: 1})

>>> FreqDist(alist)
FreqDist({2: 3, 1: 2, 4: 2, 5: 2, 6: 2, 3: 1, 7: 1, 9: 1})

现在转到

nltk

中的古腾堡语料库。

.words（）

函数返回给定相应文件名的语料库中找到的单词列表，例如：

>>> for fileid in gutenberg.fileids():
...     print fileid
...     print gutenberg.words(fileid)
...     break
... 
austen-emma.txt
[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', ...]

因此，如果我们可以使用

FreqDist

初始化来计算

austen emma.txt

中的单词

现在要过滤

FreqDist

中的单词，有两种策略：

对文件中的所有单词进行计数，然后慢慢提取您感兴趣的情态词的计数

初始化

计数器

对象时，只计算模态词，忽略其他词

例如，假设我们的单词是数字，我们只对

1,2,8

感兴趣：

>>> words = [1,1,2,3,2,3,4,5,6,7,8,2,5]
>>> Counter(words)
Counter({2: 3, 1: 2, 3: 2, 5: 2, 4: 1, 6: 1, 7: 1, 8: 1})
>>> interested_words = [1,2,8]
>>> counted = Counter(words)
>>> counted[1]
2
>>> counted[2]
3
>>> counted[8]
1

另一种方法是只计算这些单词，我们可以使用列表理解来过滤这些单词，例如：

>>> filtered_words = [word for word in words if word in interested_words]
>>> Counter(filtered_words)
Counter({2: 3, 1: 2, 8: 1})

看

对于问题的制表部分，现在我们将了解为什么FreqDist是一个奇特但有用的对象

.tablate（）

函数将FreqDist中的键放在第一行，将值（即计数）放在第二行，例如：

>>> FreqDist(filtered_words)
FreqDist({2: 3, 1: 2, 8: 1})
>>> FreqDist(filtered_words).tabulate()
2 1 8 
3 2 1

遗憾的是，

.tablate（）

打印表格的方式没有自定义功能。因此，如果需要第一列作为fileid等内容，您必须自己编写

比如说，如果你有一行来自FreqDist对象，你想把它们打印出来，你可以把它们转换成一个标签分隔的字符串，你可以这样做：

>>> print '\t'.join(map(str, [fd[word] for word in interested_words]))
2   3   1

>>> print '\t'.join(map(str, [row_name] + [fd[word] for word in interested_words]))

blahblah    2   3   1

假设您需要将rowid添加到第一列，您可以这样做：

>>> print '\t'.join(map(str, [fd[word] for word in interested_words]))
2   3   1

>>> print '\t'.join(map(str, [row_name] + [fd[word] for word in interested_words]))

blahblah    2   3   1

因此，如果您有多行：

>>> rowid_values = [('row1', FreqDist({2: 3, 1: 2, 8: 1})) , ('row2', FreqDist({2: 10, 1: 20, 8: 10})) ]
>>> for rowid, _fd in rowid_values:
...     print print_row(rowid, _fd)
... 
row1    2   3   1
row2    20  10  10

如果需要标题行，也可以打印出来：

>>> map(str, interested_words)
['1', '2', '8']
>>> ['rowids'] + map(str, interested_words)
['rowids', '1', '2', '8']
>>> '\t'.join(['rowids'] + map(str, interested_words))
'rowids\t1\t2\t8'
>>> print '\t'.join(['rowids'] + map(str, interested_words))
rowids  1   2   8

要加入他们：

>>> print '\t'.join(['rowids'] + map(str, interested_words)); print '\n'.join([print_row(rowid, _fd) for rowid, _fd in rowid_values])
rowids  1   2   8
row1    2   3   1
row2    20  10  10

根据您在问题中提供的信息，很明显您的代码有一个bug。你应该把它修好。说真的，欢迎来到stackoverflow。请参阅“帮助”部分，了解如何编写好问题的指导。（简单地说，你必须清楚地解释你的目标，并展示你迄今为止管理的（相关！）代码。目前为止，你的问题没有为任何人提供足够的信息来帮助你。我向Alexis道歉；你的代码甚至不是有效的Python（即使在我修复缩进之后）。如果这真的是你能做到的最好的，你应该从阅读nltk书籍（和/或你的教科书，如果不同的话）中的几章开始