Python 2.7 创建一个表格,使用python NLTK显示相对频率,并迭代古腾堡语料库中的18个文本

Python 2.7 创建一个表格,使用python NLTK显示相对频率,并迭代古腾堡语料库中的18个文本,python-2.7,nltk,Python 2.7,Nltk,我需要创建一个表格,显示NLTK在古腾堡语料库摘录中提供的18篇文本中使用“情态动词”(can、can、may、may、will、will和should)的相对频率 这是我的密码 for fileid in gutenberg.fileids(): fdist = nltk.FreqDist(for w in gutenberg.words(fileid)) modals = ['can', 'could', 'may', 'might', 'must', 'will','would',

我需要创建一个表格,显示NLTK在古腾堡语料库摘录中提供的18篇文本中使用“情态动词”(can、can、may、may、will、will和should)的相对频率

这是我的密码

for fileid in gutenberg.fileids():
    fdist = nltk.FreqDist(for w in gutenberg.words(fileid))
modals = ['can', 'could', 'may', 'might', 'must', 'will','would','should']
我需要将fdist制表,其中fileid为“Rows”,modals为“Columns”

TL;DR

很可能这就是您需要的:

[out]:

austen-emma.txt Counter({u'could': 825, u'would': 815, u'must': 564, u'will': 559, u'should': 366, u'might': 322, u'can': 270, u'may': 213})
austen-persuasion.txt Counter({u'could': 444, u'would': 351, u'must': 228, u'should': 185, u'might': 166, u'will': 162, u'can': 100, u'may': 87})
austen-sense.txt Counter({u'could': 568, u'would': 507, u'will': 354, u'must': 279, u'should': 228, u'might': 215, u'can': 206, u'may': 169})
bible-kjv.txt Counter({u'will': 3807, u'may': 1024, u'should': 768, u'might': 475, u'would': 443, u'can': 213, u'could': 165, u'must': 131})
blake-poems.txt Counter({u'can': 20, u'should': 6, u'may': 5, u'would': 3, u'could': 3, u'will': 3, u'might': 2, u'must': 2})
bryant-stories.txt Counter({u'could': 154, u'will': 144, u'would': 110, u'can': 75, u'must': 39, u'should': 38, u'might': 23, u'may': 18})
burgess-busterbrown.txt Counter({u'could': 56, u'would': 46, u'can': 23, u'will': 19, u'might': 17, u'must': 14, u'should': 13, u'may': 3})
carroll-alice.txt Counter({u'could': 73, u'would': 70, u'can': 57, u'must': 41, u'might': 28, u'should': 27, u'will': 24, u'may': 11})
chesterton-ball.txt Counter({u'will': 198, u'would': 139, u'can': 131, u'could': 117, u'may': 90, u'must': 81, u'should': 75, u'might': 69})
chesterton-brown.txt Counter({u'could': 170, u'would': 132, u'can': 126, u'will': 111, u'might': 71, u'must': 70, u'should': 56, u'may': 47})
chesterton-thursday.txt Counter({u'could': 148, u'can': 117, u'would': 116, u'will': 109, u'might': 71, u'may': 56, u'should': 54, u'must': 48})
edgeworth-parents.txt Counter({u'will': 517, u'would': 503, u'could': 420, u'can': 340, u'should': 271, u'must': 250, u'may': 160, u'might': 127})
melville-moby_dick.txt Counter({u'would': 421, u'will': 379, u'must': 282, u'may': 230, u'can': 220, u'could': 215, u'might': 183, u'should': 181})
milton-paradise.txt Counter({u'will': 161, u'may': 116, u'can': 107, u'might': 98, u'must': 66, u'could': 62, u'should': 55, u'would': 49})
shakespeare-caesar.txt Counter({u'will': 129, u'would': 40, u'should': 38, u'may': 35, u'must': 30, u'could': 18, u'can': 16, u'might': 12})
shakespeare-hamlet.txt Counter({u'will': 131, u'would': 60, u'may': 56, u'must': 53, u'should': 52, u'can': 33, u'might': 28, u'could': 26})
shakespeare-macbeth.txt Counter({u'will': 62, u'would': 42, u'should': 41, u'must': 33, u'may': 30, u'can': 21, u'could': 15, u'might': 5})
whitman-leaves.txt Counter({u'will': 261, u'can': 88, u'would': 85, u'may': 85, u'must': 63, u'could': 49, u'should': 42, u'might': 26})
把它们放在桌子上:

fileids would   may could   should  will    can might   must
austen-emma.txt 815 213 825 366 559 270 322 564
austen-persuasion.txt   351 87  444 185 162 100 166 228
austen-sense.txt    507 169 568 228 354 206 215 279
bible-kjv.txt   443 1024    165 768 3807    213 475 131
blake-poems.txt 3   5   3   6   3   20  2   2
bryant-stories.txt  110 18  154 38  144 75  23  39
burgess-busterbrown.txt 46  3   56  13  19  23  17  14
carroll-alice.txt   70  11  73  27  24  57  28  41
chesterton-ball.txt 139 90  117 75  198 131 69  81
chesterton-brown.txt    132 47  170 56  111 126 71  70
chesterton-thursday.txt 116 56  148 54  109 117 71  48
edgeworth-parents.txt   503 160 420 271 517 340 127 250
melville-moby_dick.txt  421 230 215 181 379 220 183 282
milton-paradise.txt 49  116 62  55  161 107 98  66
shakespeare-caesar.txt  40  35  18  38  129 16  12  30
shakespeare-hamlet.txt  60  56  26  52  131 33  28  53
shakespeare-macbeth.txt 42  30  15  41  62  21  5   33
whitman-leaves.txt  85  85  49  42  261 88  26  63

长的

首先让我们看看
FreqDist
是如何工作的

FreqDist
基本上是一个
collections.Counter
对象,这样我们就可以向它提供一个列表,并对列表中的实例进行计数:

>>> from collections import Counter
>>> from nltk import FreqDist

>>> alist = [1,2,1,2,3,4,5,6,7,2,4,5,6,9]

>>> Counter(alist)
Counter({2: 3, 1: 2, 4: 2, 5: 2, 6: 2, 3: 1, 7: 1, 9: 1})

>>> FreqDist(alist)
FreqDist({2: 3, 1: 2, 4: 2, 5: 2, 6: 2, 3: 1, 7: 1, 9: 1})
现在转到
nltk
中的古腾堡语料库。
.words()
函数返回给定相应文件名的语料库中找到的单词列表,例如:

>>> for fileid in gutenberg.fileids():
...     print fileid
...     print gutenberg.words(fileid)
...     break
... 
austen-emma.txt
[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', ...]
因此,如果我们可以使用
FreqDist
初始化来计算
austen emma.txt
中的单词

现在要过滤
FreqDist
中的单词,有两种策略:

  • 对文件中的所有单词进行计数,然后慢慢提取您感兴趣的情态词的计数

  • 初始化
    计数器
    对象时,只计算模态词,忽略其他词

  • 例如,假设我们的单词是数字,我们只对
    1,2,8
    感兴趣:

    >>> words = [1,1,2,3,2,3,4,5,6,7,8,2,5]
    >>> Counter(words)
    Counter({2: 3, 1: 2, 3: 2, 5: 2, 4: 1, 6: 1, 7: 1, 8: 1})
    >>> interested_words = [1,2,8]
    >>> counted = Counter(words)
    >>> counted[1]
    2
    >>> counted[2]
    3
    >>> counted[8]
    1
    
    另一种方法是只计算这些单词,我们可以使用列表理解来过滤这些单词,例如:

    >>> filtered_words = [word for word in words if word in interested_words]
    >>> Counter(filtered_words)
    Counter({2: 3, 1: 2, 8: 1})
    


    对于问题的制表部分,现在我们将了解为什么FreqDist是一个奇特但有用的对象

    .tablate()
    函数将FreqDist中的键放在第一行,将值(即计数)放在第二行,例如:

    >>> FreqDist(filtered_words)
    FreqDist({2: 3, 1: 2, 8: 1})
    >>> FreqDist(filtered_words).tabulate()
    2 1 8 
    3 2 1 
    
    遗憾的是,
    .tablate()
    打印表格的方式没有自定义功能。因此,如果需要第一列作为fileid等内容,您必须自己编写

    比如说,如果你有一行来自FreqDist对象,你想把它们打印出来,你可以把它们转换成一个标签分隔的字符串,你可以这样做:

    >>> print '\t'.join(map(str, [fd[word] for word in interested_words]))
    2   3   1
    
    >>> print '\t'.join(map(str, [row_name] + [fd[word] for word in interested_words]))
    
    blahblah    2   3   1
    
    假设您需要将rowid添加到第一列,您可以这样做:

    >>> print '\t'.join(map(str, [fd[word] for word in interested_words]))
    2   3   1
    
    >>> print '\t'.join(map(str, [row_name] + [fd[word] for word in interested_words]))
    
    blahblah    2   3   1
    
    因此,如果您有多行:

    >>> rowid_values = [('row1', FreqDist({2: 3, 1: 2, 8: 1})) , ('row2', FreqDist({2: 10, 1: 20, 8: 10})) ]
    >>> for rowid, _fd in rowid_values:
    ...     print print_row(rowid, _fd)
    ... 
    row1    2   3   1
    row2    20  10  10
    
    如果需要标题行,也可以打印出来:

    >>> map(str, interested_words)
    ['1', '2', '8']
    >>> ['rowids'] + map(str, interested_words)
    ['rowids', '1', '2', '8']
    >>> '\t'.join(['rowids'] + map(str, interested_words))
    'rowids\t1\t2\t8'
    >>> print '\t'.join(['rowids'] + map(str, interested_words))
    rowids  1   2   8
    
    要加入他们:

    >>> print '\t'.join(['rowids'] + map(str, interested_words)); print '\n'.join([print_row(rowid, _fd) for rowid, _fd in rowid_values])
    rowids  1   2   8
    row1    2   3   1
    row2    20  10  10
    

    根据您在问题中提供的信息,很明显您的代码有一个bug。你应该把它修好。说真的,欢迎来到stackoverflow。请参阅“帮助”部分,了解如何编写好问题的指导。(简单地说,你必须清楚地解释你的目标,并展示你迄今为止管理的(相关!)代码。目前为止,你的问题没有为任何人提供足够的信息来帮助你。我向Alexis道歉;你的代码甚至不是有效的Python(即使在我修复缩进之后)。如果这真的是你能做到的最好的,你应该从阅读nltk书籍(和/或你的教科书,如果不同的话)中的几章开始