Python NLTK-统计数据统计速度非常慢,语料库很大
我想看看我的语料库的基本统计数据,如单词/句子计数器、分布等。 我有一个tokens\u corpus\u reader\u ready.txt,其中包含137.000行这种格式的标记示例句子: Zur/Appart Zeit/NN kostenlos/ADJD aber/KON auch/ADV nur/ADV 11/卡功率/NN Zur/Appart Zeit/NN anscheinend/ADJD kostenlos/ADJD./$。 我还有一个TaggedCorpusReader,我有一个描述方法:Python NLTK-统计数据统计速度非常慢,语料库很大,python,performance,nlp,nltk,tagged-corpus,Python,Performance,Nlp,Nltk,Tagged Corpus,我想看看我的语料库的基本统计数据,如单词/句子计数器、分布等。 我有一个tokens\u corpus\u reader\u ready.txt,其中包含137.000行这种格式的标记示例句子: Zur/Appart Zeit/NN kostenlos/ADJD aber/KON auch/ADV nur/ADV 11/卡功率/NN Zur/Appart Zeit/NN anscheinend/ADJD kostenlos/ADJD./$。 我还有一个TaggedCorpusReader,我有
class CSCorpusReader(TaggedCorpusReader):
def __init__(self):
TaggedCorpusReader.__init__(self, raw_corpus_path, 'tokens_corpus_reader_ready.txt')
def describe(self):
"""
Performs a single pass of the corpus and
returns a dictionary with a variety of metrics
concerning the state of the corpus.
modified method from https://github.com/foxbook/atap/blob/master/snippets/ch03/reader.py
"""
started = time.time()
# Structures to perform counting.
counts = nltk.FreqDist()
tokens = nltk.FreqDist()
# Perform single pass over paragraphs, tokenize and count
for sent in self.sents():
print(time.time())
counts['sents'] += 1
for word in self.words():
counts['words'] += 1
tokens[word] += 1
return {
'sents': counts['sents'],
'words': counts['words'],
'vocab': len(tokens),
'lexdiv': float(counts['words']) / float(len(tokens)),
'secs': time.time() - started,
}
如果我在IPython中运行这样的描述方法:
>> corpus = CSCorpusReader()
>> print(corpus.describe())
每句话之间大约有7秒的延迟:
1543770777.502544
1543770784.383989
1543770792.2057862
1543770798.992075
1543770805.819034
1543770812.599932
如果我在tokens\u corpus\u reader\u ready.txt中用几句话运行相同的东西,那么输出时间是完全合理的:
1543771884.739753
1543771884.74035
1543771884.7408729
1543771884.7413561
{'sents':4,'words':212,'vocab':42,'lexdiv':5.0476190476190474,'secs':0.002869129180908203}
这种行为从何而来?我如何修复它
编辑1
不是每次访问语料库本身,而是对列表进行操作时,每句话的时间减少到3秒左右,但仍然很长:
sents = list(self.sents())
words = list(self.words())
# Perform single pass over paragraphs, tokenize and count
for sent in sents:
print(time.time())
counts['sents'] += 1
for word in words:
counts['words'] += 1
tokens[word] += 1
这就是你的问题:对于每个句子,你都要用单词法阅读整个语料库。难怪要花很长时间
for sent in self.sents():
print(time.time())
counts['sents'] += 1
for word in self.words():
counts['words'] += 1
tokens[word] += 1
事实上,一个句子已经标记为单词,所以这就是你的意思:
for sent in self.sents():
print(time.time())
counts['sents'] += 1
for word in sent:
counts['words'] += 1
tokens[word] += 1
哦,是的,每次它都会浏览整个单词表。谢谢