Python Can'；t从NLTK库导入bigram_Python_Nltk

Python Can'；t从NLTK库导入bigram

python

Python Can'；t从NLTK库导入bigram,python,nltk,Python,Nltk,让我困惑的快速问题。我已经安装了NLTK，它一直运行良好。然而，我正试图获得语料库的bigrams，并希望基本上使用bigrams（语料库）。。但它说，当我“从nltk导入bigrams”时，bigrams并没有定义三叉戟也是如此。我错过什么了吗？还有，我怎样才能从语料库中手动获取bigrams 我也在寻找计算二元图、三元图和四元图的频率，但我不确定到底该怎么做我用“和”在语料库的开头和结尾适当地标记了语料库。到目前为止，我们的计划如下： #!/usr/bin/env python imp

让我困惑的快速问题。我已经安装了NLTK，它一直运行良好。然而，我正试图获得语料库的bigrams，并希望基本上使用bigrams（语料库）。。但它说，当我“从nltk导入bigrams”时，bigrams并没有定义

三叉戟也是如此。我错过什么了吗？还有，我怎样才能从语料库中手动获取bigrams

我也在寻找计算二元图、三元图和四元图的频率，但我不确定到底该怎么做

我用

“

和

”

在语料库的开头和结尾适当地标记了语料库。到目前为止，我们的计划如下：

 #!/usr/bin/env python
import re
import nltk
import nltk.corpus as corpus
import tokenize
from nltk.corpus import brown

def alter_list(row):
    if row[-1] == '.':
        row[-1] = '</s>'
    else:
        row.append('</s>')
    return ['<s>'] + row

news = corpus.brown.sents(categories = 'editorial')
print len(news),'\n'

x = len(news)
for row in news[:x]:
    print(alter_list(row))

#/usr/bin/env python
进口稀土
导入nltk
将nltk.corpus导入为语料库
导入标记化
从nltk.corpus导入布朗
def alter_列表（第行）：
如果行[-1]='.'：
行[-1]=“1”
其他：
行。追加（“”）
返回['']+行
news=corpus.brown.sents（类别=‘编辑’）
打印列名（新闻），“\n”
x=len（新闻）
对于新闻[：x]中的行：
打印（更改列表（行））

我在virtualenv中测试了这一点，它可以工作：

In [20]: from nltk import bigrams

In [21]: bigrams('This is a test')
Out[21]: 
[('T', 'h'),
 ('h', 'i'),
 ('i', 's'),
 ('s', ' '),
 (' ', 'i'),
 ('i', 's'),
 ('s', ' '),
 (' ', 'a'),
 ('a', ' '),
 (' ', 't'),
 ('t', 'e'),
 ('e', 's'),
 ('s', 't')]

这是你唯一的错误吗

顺便说一下，关于你的第二个问题：

from collections import Counter
In [44]: b = bigrams('This is a test')

In [45]: Counter(b)
Out[45]: Counter({('i', 's'): 2, ('s', ' '): 2, ('a', ' '): 1, (' ', 't'): 1, ('e', 's'): 1, ('h', 'i'): 1, ('t', 'e'): 1, ('T', 'h'): 1, (' ', 'i'): 1, (' ', 'a'): 1, ('s', 't'): 1})

用词：

In [49]: b = bigrams("This is a test".split(' '))

In [50]: b
Out[50]: [('This', 'is'), ('is', 'a'), ('a', 'test')]

In [51]: Counter(b)
Out[51]: Counter({('is', 'a'): 1, ('a', 'test'): 1, ('This', 'is'): 1})

这显然是非常肤浅的，但取决于您的应用程序，它可能就足够了。显然，您可以使用nltk的标记化，它要复杂得多

为了实现您的最终目标，您可以这样做：

In [56]: d = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

In [56]: from nltk import trigrams
In [57]: tri = trigrams(d.split(' '))

In [60]: counter = Counter(tri)

In [61]: import random

In [62]: random.sample(counter, 5)
Out[62]: 
[('Ipsum', 'has', 'been'),
 ('industry.', 'Lorem', 'Ipsum'),
 ('Ipsum', 'passages,', 'and'),
 ('was', 'popularised', 'in'),
 ('galley', 'of', 'type')]

我修剪了输出，因为它不必要的大，但你明白了。

谢谢你的回复，我不知道我做了什么，但它现在正在导入。。隐马尔可夫模型。。。现在的问题是，我需要的是每个单词而不是每个字母的bigrams，这样我就可以根据每个单词进行计算。。我怎么能这么做？还需要弄清楚，即使是基于二元图（以及三元图和四元图）的ngrams计算，我最终需要基于类似于原始语料库的ngrams生成随机文本。我现在明白了。很遗憾，上面在程序中使用的“新闻”不是我可以使用的类型。拆分为。我得到一个错误：AttributeError:“ConcatenatedCorpusView”对象没有属性“split”，我如何使用我的修改后的新闻版本和注释，并使用它将其分离为bigram、tri等？编辑：好的，让我看一秒钟，但我想标记化是允许的。无论如何，您不想使用split。标记化太复杂，无法尝试该路径。抱歉，我从未知道或使用过标记化库。我应该使用什么版本？wordpunct_tokenize似乎是个不错的选择？查看：。如何标记语料库并添加所需的特殊字符？只需使用nltk.tokenize导入单词punct\u tokenize；tri_tokenized=三角图（wordpunt_tokenize（d）），其中d是原始字符串。对于特殊字符，可以使用RegexpTokenizer，但可能有更好的选择。一旦你完成了最初的分离，剩下的就很容易了。