Python 从大量.txt文件及其频率生成Ngrams（Unigrams、Bigrams等）_Python_Nltk

Python 从大量.txt文件及其频率生成Ngrams（Unigrams、Bigrams等）

python

Python 从大量.txt文件及其频率生成Ngrams（Unigrams、Bigrams等）,python,nltk,Python,Nltk,我需要用NLTK编写一个程序，将一个语料库（大量txt文件的集合）分解为单格图、双格图、三元图、四元图和五元图。我已经编写了将文件输入程序的代码输入是用英文编写的300.txt文件，我希望以Ngrams的形式输出，特别是频率计数我知道NLTK有Bigram和Trigram模块：但我并没有那么先进，无法将它们纳入我的课程输入：txt文件不是单句输出示例： Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?')

我需要用NLTK编写一个程序，将一个语料库（大量txt文件的集合）分解为单格图、双格图、三元图、四元图和五元图。我已经编写了将文件输入程序的代码

输入是用英文编写的300.txt文件，我希望以Ngrams的形式输出，特别是频率计数

我知道NLTK有Bigram和Trigram模块：

但我并没有那么先进，无法将它们纳入我的课程

输入：txt文件不是单句

输出示例：

Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')] 

Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]

到目前为止，我的代码是：

from nltk.corpus import PlaintextCorpusReader
corpus = 'C:/Users/jack3/My folder'
files = PlaintextCorpusReader(corpus, '.*')
ngrams=2

def generate(file, ngrams):
    for gram in range(0, ngrams):
    print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_"))


for file in files.fileids():
generate(file, ngrams)

有什么帮助吗？下一步该怎么办

只需使用

ntlk.ngrams

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter

text = "I need to write a program in NLTK that breaks a corpus (a large collection of \
txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ 
I need to write a program in NLTK that breaks a corpus"
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)

print Counter(bigrams)

Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,
 ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,
 ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,
 ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', 
','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,
 (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',
 'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,
 ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, 
('collection', 'of'): 1, ('files', ')'): 1})

更新（使用纯python）：

好的，既然你要求一个NLTK解决方案，这可能不是你想要的，但是，你考虑过了吗？它有一个NLTK后端，但语法更简单。它看起来像这样：

from textblob import TextBlob

text = "Paste your text or text-containing variable here" 
blob = TextBlob(text)
ngram_var = blob.ngrams(n=3)
print(ngram_var)

Output:
[WordList(['Paste', 'your', 'text']), WordList(['your', 'text', 'or']), WordList(['text', 'or', 'text-containing']), WordList(['or', 'text-containing', 'variable']), WordList(['text-containing', 'variable', 'here'])]

当然，您仍然需要使用计数器或其他方法来添加每个ngram的计数

然而，我所能找到的最快的方法（到目前为止）是创建任何你喜欢的ngram，也可以将它们计算在一个函数中，它们来自2012年的post，并使用Itertools。这太棒了。

这里有一个简单的例子，使用纯Python生成任何

ngram

：

>>> def ngrams(s, n=2, i=0):
...     while len(s[i:i+n]) == n:
...         yield s[i:i+n]
...         i += 1
...
>>> txt = 'Python is one of the awesomest languages'

>>> unigram = ngrams(txt.split(), n=1)
>>> list(unigram)
[['Python'], ['is'], ['one'], ['of'], ['the'], ['awesomest'], ['languages']]

>>> bigram = ngrams(txt.split(), n=2)
>>> list(bigram)
[['Python', 'is'], ['is', 'one'], ['one', 'of'], ['of', 'the'], ['the', 'awesomest'], ['awesomest', 'languages']]

>>> trigram = ngrams(txt.split(), n=3)
>>> list(trigram)
[['Python', 'is', 'one'], ['is', 'one', 'of'], ['one', 'of', 'the'], ['of', 'the', 'awesomest'], ['the', 'awesomest',
'languages']]

如果效率是一个问题，您必须构建多个不同的n-gram，但您希望使用纯python，我会这样做：

from itertools import chain

def n_grams(seq, n=1):
    """Returns an iterator over the n-grams given a list_tokens"""
    shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shifted_tokens = (shift_token(i) for i in range(n))
    tuple_ngrams = zip(*shifted_tokens)
    return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)

def range_ngrams(list_tokens, ngram_range=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
    return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))

用法：

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~与NLTK相同的速度：

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

从我的重新发布。

可能会有所帮助。看

上面@hellpander的答案是正确的，但对于一个非常大的语料库来说效率不高（我在处理大约650K文档时遇到了困难）。每次更新频率时，代码都会大大降低速度，这是因为随着内容的增长，查找字典的成本会很高。因此，您需要额外的缓冲区变量来帮助缓存@hellpander answer的频率计数器。因此，每次迭代一个新文档时，您都要将它添加到临时的、较小的计数器dict中，而不是对一个非常大的频率（字典）进行键查找。经过一些迭代后，它将被添加到全局频率中。这样会更快，因为庞大的字典查找的频率要低得多

import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])

for i in range(0, len(corpus)):
    token = nltk.word_tokenize(corpus[i])
    bigrams = ngrams(token, 2)
    f += Counter(bigrams)
    if (i%10000 == 0):
        # store to global frequencies counter and clear up f every 10000 docs.
        frequencies += Counter(bigrams)
        f = Counter([])

您好Hellpanderrr谢谢，我需要的部分，使我能够插入我的整个数据包（一个文件夹的txt文件）到我的程序，这样它可以通过我的txt文件运行，并给我的txt文件的输出。因此，与其阅读text=“xxxxxx”，我需要它，请参考我的txt文件文件夹。当我查找Bigram时，它会给出此消息！这是因为

ngram

函数返回一个生成器，您需要在其上调用

list

函数来实际提取内容。什么样的列表？你能告诉我它应该放在上面代码中的什么地方吗？如果你想看到

ngrams

函数返回的内容，你需要将它发送到

列表

函数，例如

列表（ngrams（token，2））

。但是上面的代码应该可以很好地工作，只需插入文件的路径即可。非常有用！

import spacy  
nlp_en = spacy.load("en_core_web_sm")
[x.text for x in doc]

import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])

for i in range(0, len(corpus)):
    token = nltk.word_tokenize(corpus[i])
    bigrams = ngrams(token, 2)
    f += Counter(bigrams)
    if (i%10000 == 0):
        # store to global frequencies counter and clear up f every 10000 docs.
        frequencies += Counter(bigrams)
        f = Counter([])