Python 2.7 创建一个表格,使用python NLTK显示相对频率,并迭代古腾堡语料库中的18个文本
我需要创建一个表格,显示NLTK在古腾堡语料库摘录中提供的18篇文本中使用“情态动词”(can、can、may、may、will、will和should)的相对频率 这是我的密码Python 2.7 创建一个表格,使用python NLTK显示相对频率,并迭代古腾堡语料库中的18个文本,python-2.7,nltk,Python 2.7,Nltk,我需要创建一个表格,显示NLTK在古腾堡语料库摘录中提供的18篇文本中使用“情态动词”(can、can、may、may、will、will和should)的相对频率 这是我的密码 for fileid in gutenberg.fileids(): fdist = nltk.FreqDist(for w in gutenberg.words(fileid)) modals = ['can', 'could', 'may', 'might', 'must', 'will','would',
for fileid in gutenberg.fileids():
fdist = nltk.FreqDist(for w in gutenberg.words(fileid))
modals = ['can', 'could', 'may', 'might', 'must', 'will','would','should']
我需要将fdist制表,其中fileid为“Rows”,modals为“Columns”TL;DR
很可能这就是您需要的:
[out]:
austen-emma.txt Counter({u'could': 825, u'would': 815, u'must': 564, u'will': 559, u'should': 366, u'might': 322, u'can': 270, u'may': 213})
austen-persuasion.txt Counter({u'could': 444, u'would': 351, u'must': 228, u'should': 185, u'might': 166, u'will': 162, u'can': 100, u'may': 87})
austen-sense.txt Counter({u'could': 568, u'would': 507, u'will': 354, u'must': 279, u'should': 228, u'might': 215, u'can': 206, u'may': 169})
bible-kjv.txt Counter({u'will': 3807, u'may': 1024, u'should': 768, u'might': 475, u'would': 443, u'can': 213, u'could': 165, u'must': 131})
blake-poems.txt Counter({u'can': 20, u'should': 6, u'may': 5, u'would': 3, u'could': 3, u'will': 3, u'might': 2, u'must': 2})
bryant-stories.txt Counter({u'could': 154, u'will': 144, u'would': 110, u'can': 75, u'must': 39, u'should': 38, u'might': 23, u'may': 18})
burgess-busterbrown.txt Counter({u'could': 56, u'would': 46, u'can': 23, u'will': 19, u'might': 17, u'must': 14, u'should': 13, u'may': 3})
carroll-alice.txt Counter({u'could': 73, u'would': 70, u'can': 57, u'must': 41, u'might': 28, u'should': 27, u'will': 24, u'may': 11})
chesterton-ball.txt Counter({u'will': 198, u'would': 139, u'can': 131, u'could': 117, u'may': 90, u'must': 81, u'should': 75, u'might': 69})
chesterton-brown.txt Counter({u'could': 170, u'would': 132, u'can': 126, u'will': 111, u'might': 71, u'must': 70, u'should': 56, u'may': 47})
chesterton-thursday.txt Counter({u'could': 148, u'can': 117, u'would': 116, u'will': 109, u'might': 71, u'may': 56, u'should': 54, u'must': 48})
edgeworth-parents.txt Counter({u'will': 517, u'would': 503, u'could': 420, u'can': 340, u'should': 271, u'must': 250, u'may': 160, u'might': 127})
melville-moby_dick.txt Counter({u'would': 421, u'will': 379, u'must': 282, u'may': 230, u'can': 220, u'could': 215, u'might': 183, u'should': 181})
milton-paradise.txt Counter({u'will': 161, u'may': 116, u'can': 107, u'might': 98, u'must': 66, u'could': 62, u'should': 55, u'would': 49})
shakespeare-caesar.txt Counter({u'will': 129, u'would': 40, u'should': 38, u'may': 35, u'must': 30, u'could': 18, u'can': 16, u'might': 12})
shakespeare-hamlet.txt Counter({u'will': 131, u'would': 60, u'may': 56, u'must': 53, u'should': 52, u'can': 33, u'might': 28, u'could': 26})
shakespeare-macbeth.txt Counter({u'will': 62, u'would': 42, u'should': 41, u'must': 33, u'may': 30, u'can': 21, u'could': 15, u'might': 5})
whitman-leaves.txt Counter({u'will': 261, u'can': 88, u'would': 85, u'may': 85, u'must': 63, u'could': 49, u'should': 42, u'might': 26})
把它们放在桌子上:
fileids would may could should will can might must
austen-emma.txt 815 213 825 366 559 270 322 564
austen-persuasion.txt 351 87 444 185 162 100 166 228
austen-sense.txt 507 169 568 228 354 206 215 279
bible-kjv.txt 443 1024 165 768 3807 213 475 131
blake-poems.txt 3 5 3 6 3 20 2 2
bryant-stories.txt 110 18 154 38 144 75 23 39
burgess-busterbrown.txt 46 3 56 13 19 23 17 14
carroll-alice.txt 70 11 73 27 24 57 28 41
chesterton-ball.txt 139 90 117 75 198 131 69 81
chesterton-brown.txt 132 47 170 56 111 126 71 70
chesterton-thursday.txt 116 56 148 54 109 117 71 48
edgeworth-parents.txt 503 160 420 271 517 340 127 250
melville-moby_dick.txt 421 230 215 181 379 220 183 282
milton-paradise.txt 49 116 62 55 161 107 98 66
shakespeare-caesar.txt 40 35 18 38 129 16 12 30
shakespeare-hamlet.txt 60 56 26 52 131 33 28 53
shakespeare-macbeth.txt 42 30 15 41 62 21 5 33
whitman-leaves.txt 85 85 49 42 261 88 26 63
长的: 首先让我们看看
FreqDist
是如何工作的
FreqDist
基本上是一个collections.Counter
对象,这样我们就可以向它提供一个列表,并对列表中的实例进行计数:
>>> from collections import Counter
>>> from nltk import FreqDist
>>> alist = [1,2,1,2,3,4,5,6,7,2,4,5,6,9]
>>> Counter(alist)
Counter({2: 3, 1: 2, 4: 2, 5: 2, 6: 2, 3: 1, 7: 1, 9: 1})
>>> FreqDist(alist)
FreqDist({2: 3, 1: 2, 4: 2, 5: 2, 6: 2, 3: 1, 7: 1, 9: 1})
现在转到nltk
中的古腾堡语料库。.words()
函数返回给定相应文件名的语料库中找到的单词列表,例如:
>>> for fileid in gutenberg.fileids():
... print fileid
... print gutenberg.words(fileid)
... break
...
austen-emma.txt
[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', ...]
因此,如果我们可以使用FreqDist
初始化来计算austen emma.txt
中的单词
现在要过滤FreqDist
中的单词,有两种策略:
计数器
对象时,只计算模态词,忽略其他词1,2,8
感兴趣:
>>> words = [1,1,2,3,2,3,4,5,6,7,8,2,5]
>>> Counter(words)
Counter({2: 3, 1: 2, 3: 2, 5: 2, 4: 1, 6: 1, 7: 1, 8: 1})
>>> interested_words = [1,2,8]
>>> counted = Counter(words)
>>> counted[1]
2
>>> counted[2]
3
>>> counted[8]
1
另一种方法是只计算这些单词,我们可以使用列表理解来过滤这些单词,例如:
>>> filtered_words = [word for word in words if word in interested_words]
>>> Counter(filtered_words)
Counter({2: 3, 1: 2, 8: 1})
看
对于问题的制表部分,现在我们将了解为什么FreqDist是一个奇特但有用的对象
.tablate()
函数将FreqDist中的键放在第一行,将值(即计数)放在第二行,例如:
>>> FreqDist(filtered_words)
FreqDist({2: 3, 1: 2, 8: 1})
>>> FreqDist(filtered_words).tabulate()
2 1 8
3 2 1
遗憾的是,.tablate()
打印表格的方式没有自定义功能。因此,如果需要第一列作为fileid等内容,您必须自己编写
比如说,如果你有一行来自FreqDist对象,你想把它们打印出来,你可以把它们转换成一个标签分隔的字符串,你可以这样做:
>>> print '\t'.join(map(str, [fd[word] for word in interested_words]))
2 3 1
>>> print '\t'.join(map(str, [row_name] + [fd[word] for word in interested_words]))
blahblah 2 3 1
假设您需要将rowid添加到第一列,您可以这样做:
>>> print '\t'.join(map(str, [fd[word] for word in interested_words]))
2 3 1
>>> print '\t'.join(map(str, [row_name] + [fd[word] for word in interested_words]))
blahblah 2 3 1
因此,如果您有多行:
>>> rowid_values = [('row1', FreqDist({2: 3, 1: 2, 8: 1})) , ('row2', FreqDist({2: 10, 1: 20, 8: 10})) ]
>>> for rowid, _fd in rowid_values:
... print print_row(rowid, _fd)
...
row1 2 3 1
row2 20 10 10
如果需要标题行,也可以打印出来:
>>> map(str, interested_words)
['1', '2', '8']
>>> ['rowids'] + map(str, interested_words)
['rowids', '1', '2', '8']
>>> '\t'.join(['rowids'] + map(str, interested_words))
'rowids\t1\t2\t8'
>>> print '\t'.join(['rowids'] + map(str, interested_words))
rowids 1 2 8
要加入他们:
>>> print '\t'.join(['rowids'] + map(str, interested_words)); print '\n'.join([print_row(rowid, _fd) for rowid, _fd in rowid_values])
rowids 1 2 8
row1 2 3 1
row2 20 10 10
根据您在问题中提供的信息,很明显您的代码有一个bug。你应该把它修好。说真的,欢迎来到stackoverflow。请参阅“帮助”部分,了解如何编写好问题的指导。(简单地说,你必须清楚地解释你的目标,并展示你迄今为止管理的(相关!)代码。目前为止,你的问题没有为任何人提供足够的信息来帮助你。我向Alexis道歉;你的代码甚至不是有效的Python(即使在我修复缩进之后)。如果这真的是你能做到的最好的,你应该从阅读nltk书籍(和/或你的教科书,如果不同的话)中的几章开始