将n grams python生成器输出保存为cvs文件_Python_Csv_Nltk

将n grams python生成器输出保存为cvs文件

python csv

将n grams python生成器输出保存为cvs文件,python,csv,nltk,Python,Csv,Nltk,我从python中的文本数据中查找n个gram 我使用了NLTK软件包来实现这一点。这是密码 from nltk.util import ngrams bigrams=ngrams(cleaned_docs,2) trigrams=ngrams(cleaned_docs,3) quadgrams=ngrams(cleaned_docs,4) pentagrams=ngrams(cleaned_docs,5) 这里是文本中的标记化单词列表。在这里，每个返回的类型都是一个生成器，其值为n克的元组。

我从python中的文本数据中查找n个gram

我使用了NLTK软件包来实现这一点。这是密码

from nltk.util import ngrams
bigrams=ngrams(cleaned_docs,2)
trigrams=ngrams(cleaned_docs,3)
quadgrams=ngrams(cleaned_docs,4)
pentagrams=ngrams(cleaned_docs,5)

这里是文本中的标记化单词列表。在这里，每个返回的类型都是一个生成器，其值为n克的元组。对于bi-gram，这是它的外观：

for x in bigrams:
    print x

("mom's", 'hi')
('this', 'in')
('in', 'house')

我想得到上面定义的每n克的频率分布，并将它们按频率降序保存在cvs文件中。csv将有两列，一列是n gram名称，另一列是文本中相应的计数

另外，我想把n克的频率绘制成另存为.jpeg文件。这是我用来绘制单字或词频的代码。但不确定如何使用nltk fd对象将其保存为jpeg

fd = nltk.FreqDist(cleaned_docs)
fig = plt.figure(figsize=(20,15))
plt.ylabel("frequency",fontsize=25)
plt.xlabel("Words",fontsize=25)
plt.rc('xtick', labelsize=15) 
plt.rc('ytick', labelsize=15)
plt.title("Word Frequency Distribution",fontsize=25)
fd.plot(80,cumulative=False)

请注意，我正在寻找一个更快的解决方案，因为我的文字是巨大的。我有550K个观察值，每个观察值平均有500多个字符的文本数据。所以，大图和n克的数量也会很大

类是python的一个子类，所以它没有什么特别之处。它将统计每个元素在传递给它的iterable中的出现次数

n2_freq = ntlk.FreqDist(bigrams)

要以降序频率获取元素，可以使用以下方法

要保存图形，您需要使用由

plt.figure

返回的

fig

对象，它应该有一个方法

正如您在屏幕上看到的，它不会返回任何内容

for bigram, freq in n2_freq.most_common():
    # Print them...

fig = plt.figure(figsize=(20,15))
[...]
n2_freq.plot()

fig.savefig('bigram_freq_dist.jpg')