Python 从大量.txt文件及其频率生成Ngrams(Unigrams、Bigrams等)

Python 从大量.txt文件及其频率生成Ngrams(Unigrams、Bigrams等),python,nltk,Python,Nltk,我需要用NLTK编写一个程序,将一个语料库(大量txt文件的集合)分解为单格图、双格图、三元图、四元图和五元图。我已经编写了将文件输入程序的代码 输入是用英文编写的300.txt文件,我希望以Ngrams的形式输出,特别是频率计数 我知道NLTK有Bigram和Trigram模块: 但我并没有那么先进,无法将它们纳入我的课程 输入:txt文件不是单句 输出示例: Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?')

我需要用NLTK编写一个程序,将一个语料库(大量txt文件的集合)分解为单格图、双格图、三元图、四元图和五元图。我已经编写了将文件输入程序的代码

输入是用英文编写的300.txt文件,我希望以Ngrams的形式输出,特别是频率计数

我知道NLTK有Bigram和Trigram模块:

但我并没有那么先进,无法将它们纳入我的课程

输入:txt文件不是单句

输出示例:

Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')] 

Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]
到目前为止,我的代码是:

from nltk.corpus import PlaintextCorpusReader
corpus = 'C:/Users/jack3/My folder'
files = PlaintextCorpusReader(corpus, '.*')
ngrams=2

def generate(file, ngrams):
    for gram in range(0, ngrams):
    print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_"))


for file in files.fileids():
generate(file, ngrams)

有什么帮助吗?下一步该怎么办

只需使用
ntlk.ngrams

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter

text = "I need to write a program in NLTK that breaks a corpus (a large collection of \
txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ 
I need to write a program in NLTK that breaks a corpus"
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)

print Counter(bigrams)

Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,
 ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,
 ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,
 ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', 
','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,
 (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',
 'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,
 ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, 
('collection', 'of'): 1, ('files', ')'): 1})
更新(使用纯python):


好的,既然你要求一个NLTK解决方案,这可能不是你想要的,但是,你考虑过了吗?它有一个NLTK后端,但语法更简单。它看起来像这样:

from textblob import TextBlob

text = "Paste your text or text-containing variable here" 
blob = TextBlob(text)
ngram_var = blob.ngrams(n=3)
print(ngram_var)

Output:
[WordList(['Paste', 'your', 'text']), WordList(['your', 'text', 'or']), WordList(['text', 'or', 'text-containing']), WordList(['or', 'text-containing', 'variable']), WordList(['text-containing', 'variable', 'here'])]
当然,您仍然需要使用计数器或其他方法来添加每个ngram的计数


然而,我所能找到的最快的方法(到目前为止)是创建任何你喜欢的ngram,也可以将它们计算在一个函数中,它们来自2012年的post,并使用Itertools。这太棒了。

这里有一个简单的例子,使用纯Python生成任何
ngram

>>> def ngrams(s, n=2, i=0):
...     while len(s[i:i+n]) == n:
...         yield s[i:i+n]
...         i += 1
...
>>> txt = 'Python is one of the awesomest languages'

>>> unigram = ngrams(txt.split(), n=1)
>>> list(unigram)
[['Python'], ['is'], ['one'], ['of'], ['the'], ['awesomest'], ['languages']]

>>> bigram = ngrams(txt.split(), n=2)
>>> list(bigram)
[['Python', 'is'], ['is', 'one'], ['one', 'of'], ['of', 'the'], ['the', 'awesomest'], ['awesomest', 'languages']]

>>> trigram = ngrams(txt.split(), n=3)
>>> list(trigram)
[['Python', 'is', 'one'], ['is', 'one', 'of'], ['one', 'of', 'the'], ['of', 'the', 'awesomest'], ['the', 'awesomest',
'languages']]

如果效率是一个问题,您必须构建多个不同的n-gram,但您希望使用纯python,我会这样做:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an iterator over the n-grams given a list_tokens"""
    shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shifted_tokens = (shift_token(i) for i in range(n))
    tuple_ngrams = zip(*shifted_tokens)
    return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)

def range_ngrams(list_tokens, ngram_range=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
    return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))
用法:

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]
~与NLTK相同的速度:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
从我的重新发布。

可能会有所帮助。看


上面@hellpander的答案是正确的,但对于一个非常大的语料库来说效率不高(我在处理大约650K文档时遇到了困难)。每次更新频率时,代码都会大大降低速度,这是因为随着内容的增长,查找字典的成本会很高。因此,您需要额外的缓冲区变量来帮助缓存@hellpander answer的频率计数器。因此,每次迭代一个新文档时,您都要将它添加到临时的、较小的计数器dict中,而不是对一个非常大的频率(字典)进行键查找。经过一些迭代后,它将被添加到全局频率中。这样会更快,因为庞大的字典查找的频率要低得多

import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])

for i in range(0, len(corpus)):
    token = nltk.word_tokenize(corpus[i])
    bigrams = ngrams(token, 2)
    f += Counter(bigrams)
    if (i%10000 == 0):
        # store to global frequencies counter and clear up f every 10000 docs.
        frequencies += Counter(bigrams)
        f = Counter([])

您好Hellpanderrr谢谢,我需要的部分,使我能够插入我的整个数据包(一个文件夹的txt文件)到我的程序,这样它可以通过我的txt文件运行,并给我的txt文件的输出。因此,与其阅读text=“xxxxxx”,我需要它,请参考我的txt文件文件夹。当我查找Bigram时,它会给出此消息!这是因为
ngram
函数返回一个生成器,您需要在其上调用
list
函数来实际提取内容。什么样的列表?你能告诉我它应该放在上面代码中的什么地方吗?如果你想看到
ngrams
函数返回的内容,你需要将它发送到
列表
函数,例如
列表(ngrams(token,2))
。但是上面的代码应该可以很好地工作,只需插入文件的路径即可。非常有用!
import spacy  
nlp_en = spacy.load("en_core_web_sm")
[x.text for x in doc]
import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])

for i in range(0, len(corpus)):
    token = nltk.word_tokenize(corpus[i])
    bigrams = ngrams(token, 2)
    f += Counter(bigrams)
    if (i%10000 == 0):
        # store to global frequencies counter and clear up f every 10000 docs.
        frequencies += Counter(bigrams)
        f = Counter([])