如何循环浏览语料库中的文件：Python_Python_Nltk_Corpus

如何循环浏览语料库中的文件：Python

python

如何循环浏览语料库中的文件：Python,python,nltk,corpus,Python,Nltk,Corpus,我还有其他方法需要处理语料库中的每个txt文件。我如何循环浏览它们 import nltk from nltk.corpus import PlaintextCorpusReader as pcr def main(): cor = corpus() # for every text file in the corpus: #Do this method def corpus(): corpus_root='corpus/' corp = pc

我还有其他方法需要处理语料库中的每个txt文件。我如何循环浏览它们

import nltk
from nltk.corpus import PlaintextCorpusReader as pcr

def main():
    cor = corpus()
    # for every text file in the corpus:
        #Do this method

def corpus():
    corpus_root='corpus/'
    corp = pcr(corpus_root,'.*\.txt')
    corp = corp.raw()
    return corp

main()

你可以使用glob

import glob
glob.glob("corpus/*")

除非我弄错了，否则我认为这有一个非常简单的答案：

# for every text file in the corpus
for text_file in cor:
    # Do this method
    my_method(text_file)

nltk语料库阅读器有一个方法

fileids（）

，您应该使用该方法：

mycorpus = pcr(corpus_root, r'.*\.txt')

for fname in mycorpus.fileids():
    text = mycorpus.raw(fname)
    sents = mycorpus.sents(fname)
    # or whatever

当您使用文件名调用

raw（）

、

sents（）

words（）

、

taged\u words（）

等时，您只会得到指定文件的内容。如果需要语料库的多文件子集，还可以传递文件名列表

注：这里没有区别，但是您应该使用原始字符串作为regexp（见上文）

您可以在

语料库中发布文件结构吗？另外，你打算如何处理这些文件？这是一个nltk问题；从pcr
的参数中可以清楚地看到结构。