Python 如何查找包含缺少元素的XML文件？_Python_Xml_Deep Learning_Nlp_Elementtree

Python 如何查找包含缺少元素的XML文件？

python xml deep-learning nlp

Python 如何查找包含缺少元素的XML文件？,python,xml,deep-learning,nlp,elementtree,Python,Xml,Deep Learning,Nlp,Elementtree,Python新手，在此进行深入学习我有10000个XML文件，其中包含有关专利文档的信息（从WIPO获得）。我想提取每个文档的标题、摘要和分类。我使用ElementTree成功地做到了这一点，并将它们保存在3个列表中，但我意识到有一个文档缺少classification元素，如何找出它是哪一个以下是我目前掌握的代码： abstracts=[] titles=[] tags=[] for filename in os.listdir(path): if not filename.en

Python新手，在此进行深入学习

我有10000个XML文件，其中包含有关专利文档的信息（从WIPO获得）。我想提取每个文档的标题、摘要和分类。我使用ElementTree成功地做到了这一点，并将它们保存在3个列表中，但我意识到有一个文档缺少classification元素，如何找出它是哪一个

以下是我目前掌握的代码：

abstracts=[]
titles=[]
tags=[]

for filename in os.listdir(path):
    if not filename.endswith('.xml'): continue
    file = os.path.join(path, filename)
    tree = ET.parse(file)
    root = tree.getroot()

    for title in root.iter('invention-title'):
        titles.append(child.text)

    for abs in root.iter('abstract'):
        abstracts.append(abs.text)

    for tag in root.findall('ipc-postreform'):
        tags.append(tag.find('classification-ipc').text)

谢谢

如果我正确理解了您的代码和用例，您可以边走边查看标题、摘要和分类

将xml.etree.ElementTree作为ET导入
摘要=[]
标题=[]
标签=[]
对于os.listdir（路径）中的文件名：
如果不是filename.endswith（'.xml'）：继续
file=os.path.join（路径，文件名）
tree=ET.parse（文件）
root=tree.getroot（）
title=next（root.iter（'invention-title'））
#title=root.find（'invention-title'）
abstract=next（root.iter（'abstract'））
#abstract=root.find（'abstract'）
tag=next（root.findall（'ipc-postreform'））
如果没有标签：
引发异常（{}未找到标记。格式（标题））
分类=标记。查找（'classification-ipc'）。文本
如果没有分类：
引发异常（{}未找到分类。格式（标题））
标题。附加（标题）
tags.append（分类）
摘要.附加（摘要）

正如@Mihail-Burduja提到的，不需要for循环，所以我用一个对

next（）

的调用来替换它们。您可能可以使用

find（）

。

因此，您的目录中有10000个xml文件，每个文件都有一个标题、一个摘要和（通常）一个“ipc postreform”元素？是的，没错。只需在每次迭代结束时比较摘要、标题和标记的长度，您就会找到缺少的一个。每个XML都包含其中一个，为什么要使用for？

len(abstracts)
10000

len(titles)
10000

len(tags)
9999