Python Can NLTK'；XMLCorpusReader是否可用于多文件语料库？_Python_Xml_Nltk_Nlp

Python Can NLTK'；XMLCorpusReader是否可用于多文件语料库？

python xml nlp

Python Can NLTK'；XMLCorpusReader是否可用于多文件语料库？,python,xml,nltk,nlp,Python,Xml,Nltk,Nlp,我正在尝试使用NLTK对包含每篇文章的XML文件（新闻行业文本格式NITF）的进行一些工作我可以毫无问题地解析单个文档，如下所示： from nltk.corpus.reader import XMLCorpusReader reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml') 不过我需要研究整个语料库。我试着这样做： reader = XMLCorpusReader('corpor

我正在尝试使用NLTK对包含每篇文章的XML文件（新闻行业文本格式NITF）的进行一些工作

我可以毫无问题地解析单个文档，如下所示：

from nltk.corpus.reader import XMLCorpusReader
reader = XMLCorpusReader('nltk_data/corpora/nytimes/1987/01/01', r'0000000.xml')

不过我需要研究整个语料库。我试着这样做：

reader = XMLCorpusReader('corpora/nytimes', r'.*')

但这不会创建一个可用的reader对象。比如说

len(reader.words())

raise TypeError('Expected a single file identifier string')
TypeError: Expected a single file identifier string

如何将语料库读入NLTK

我是NLTK新手，非常感谢您的帮助。

是的，您可以指定多个文件。（发件人：）

这里的问题是，我怀疑您的所有文件都包含在一个文件结构中，该结构的行数为

corpora/nytimes/year/month/date

。XMLCorpusReader不会为您递归遍历目录。i、例如，使用上面的代码，

xmlcopusreader（'corpora/nytimes'，r'.*'）

，xmlcopusreader只在

corpora/nytimes/

中查看xml文件（即，没有，因为只有文件夹），而不在

corpora/nytimes

可能包含的任何子文件夹中。此外，您可能打算使用

*.xml

作为第二个参数

我建议您自己遍历文件夹以构建绝对路径（上面的文档指定

fileid

参数的显式路径将起作用），或者如果您有一个年/月/日期组合列表，则可以利用它来发挥您的优势。

我不是NLTK专家，因此可能有更简单的方法来实现这一点，但我天真地建议你使用。它支持Unix STL路径名模式扩展

from glob import glob
texts = glob('nltk_data/corpora/nytimes/*')

这将以列表形式提供与指定表达式匹配的文件名。然后，根据您希望/需要一次打开的数量，您可以执行以下操作：

from nltk.corpus.reader import XMLCorpusReader
for item_path in texts:
    reader = XMLCorpusReader('nltk_data/corpora/nytimes/', item_path)

正如@waffle paradox:所建议的，您还可以根据您的具体需要缩减

文本列表。以下是基于机器渴望和waffle paradox评论的解决方案。
使用构建文章列表，并将其作为列表传递给XMLCorpusReader：
from glob import glob
import re
years = glob('nltk_data/corpora/nytimes_test/*')
year_months = []
for year in years:
    year_months += glob(year+'/*')
    print year_months
days = []
for year_month in year_months:
    days += glob(year_month+'/*')
articles = []
for day in days:
    articles += glob(day+'/*.xml')
file_ids = []
for article in articles:
    file_ids.append(re.sub('nltk_data/corpora/nytimes_test','',article))
reader = XMLCorpusReader('nltk_data/corpora/nytimes_test', articles)

谢谢华夫饼悖论。这很有帮助。