Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/323.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python NLTK-统计数据统计速度非常慢,语料库很大_Python_Performance_Nlp_Nltk_Tagged Corpus - Fatal编程技术网

Python NLTK-统计数据统计速度非常慢,语料库很大

Python NLTK-统计数据统计速度非常慢,语料库很大,python,performance,nlp,nltk,tagged-corpus,Python,Performance,Nlp,Nltk,Tagged Corpus,我想看看我的语料库的基本统计数据,如单词/句子计数器、分布等。 我有一个tokens\u corpus\u reader\u ready.txt,其中包含137.000行这种格式的标记示例句子: Zur/Appart Zeit/NN kostenlos/ADJD aber/KON auch/ADV nur/ADV 11/卡功率/NN Zur/Appart Zeit/NN anscheinend/ADJD kostenlos/ADJD./$。 我还有一个TaggedCorpusReader,我有

我想看看我的语料库的基本统计数据,如单词/句子计数器、分布等。 我有一个tokens\u corpus\u reader\u ready.txt,其中包含137.000行这种格式的标记示例句子:

Zur/Appart Zeit/NN kostenlos/ADJD aber/KON auch/ADV nur/ADV 11/卡功率/NN Zur/Appart Zeit/NN anscheinend/ADJD kostenlos/ADJD./$。

我还有一个TaggedCorpusReader,我有一个描述方法:

class CSCorpusReader(TaggedCorpusReader):
  def __init__(self):
    TaggedCorpusReader.__init__(self, raw_corpus_path, 'tokens_corpus_reader_ready.txt')

    def describe(self):
    """
    Performs a single pass of the corpus and
    returns a dictionary with a variety of metrics
    concerning the state of the corpus.

    modified method from https://github.com/foxbook/atap/blob/master/snippets/ch03/reader.py
    """
    started = time.time()

    # Structures to perform counting.
    counts = nltk.FreqDist()
    tokens = nltk.FreqDist()

    # Perform single pass over paragraphs, tokenize and count
    for sent in self.sents():
        print(time.time())
        counts['sents'] += 1

        for word in self.words():
            counts['words'] += 1
            tokens[word] += 1

    return {
        'sents':  counts['sents'],
        'words':  counts['words'],
        'vocab':  len(tokens),
        'lexdiv': float(counts['words']) / float(len(tokens)),
        'secs':   time.time() - started,
    }
如果我在IPython中运行这样的描述方法:

>> corpus = CSCorpusReader()
>> print(corpus.describe())
每句话之间大约有7秒的延迟:

1543770777.502544 1543770784.383989 1543770792.2057862 1543770798.992075 1543770805.819034 1543770812.599932

如果我在tokens\u corpus\u reader\u ready.txt中用几句话运行相同的东西,那么输出时间是完全合理的:

1543771884.739753 1543771884.74035 1543771884.7408729 1543771884.7413561 {'sents':4,'words':212,'vocab':42,'lexdiv':5.0476190476190474,'secs':0.002869129180908203}

这种行为从何而来?我如何修复它

编辑1 不是每次访问语料库本身,而是对列表进行操作时,每句话的时间减少到3秒左右,但仍然很长:

    sents = list(self.sents())
    words = list(self.words())

    # Perform single pass over paragraphs, tokenize and count
    for sent in sents:
        print(time.time())
        counts['sents'] += 1

        for word in words:
            counts['words'] += 1
            tokens[word] += 1

这就是你的问题:对于每个句子,你都要用单词法阅读整个语料库。难怪要花很长时间

for sent in self.sents():
    print(time.time())
    counts['sents'] += 1

    for word in self.words():
        counts['words'] += 1
        tokens[word] += 1
事实上,一个句子已经标记为单词,所以这就是你的意思:

for sent in self.sents():
    print(time.time())
    counts['sents'] += 1

    for word in sent:
        counts['words'] += 1
        tokens[word] += 1

哦,是的,每次它都会浏览整个单词表。谢谢