Python 解析XML lxml我应该跳过或删除注释阻塞解析吗_Python_Xml_Lxml

Python 解析XML lxml我应该跳过或删除注释阻塞解析吗

python xml

Python 解析XML lxml我应该跳过或删除注释阻塞解析吗,python,xml,lxml,Python,Xml,Lxml,我有一个XML文件，第一行以 <!-- 该文件在使用Objectify执行转储根目录时有效，但运行完美，失败的是etree 这是一个相当大的文件XML的第一个元素的注释和选定的剪贴 <!-- Copyright Notice: © 2010 Racing NSW (and other parties working with it). NSW racing information,including fields, form and results, is subject to c

我有一个XML文件，第一行以

<!--

该文件在使用Objectify执行转储根目录时有效，但运行完美，失败的是etree

这是一个相当大的文件XML的第一个元素的注释和选定的剪贴

<!-- Copyright Notice: © 2010 Racing NSW (and other parties working with it). NSW racing information,including fields, form and results, is subject to copyright which is owned by Racing NSW and other parties working with it. -->
<meeting id="42977" barriertrial="0" venue="Rosehill Gardens" date="2016-05-21T00:00:00" gearchanges="-1" stewardsreport="-1" gearlist="-1" racebook="0" postracestewards="0" meetingtype="TAB" rail="Timing - Electronic : Rail - +6m" weather="Fine      " trackcondition="Good 3    " nomsdeadline="2016-05-16T11:00:00" weightsdeadline="2016-05-17T16:00:00" acceptdeadline="2016-05-18T09:00:00" jockeydeadline="2016-05-18T12:00:00">
</meeting>

如果您确实想删除注释，但可以传递

parser=et.HTMLParser（Remove_comments=True）

或

parser=et.XMLParser（Remove_comments=True），注释应该不会对etree造成任何问题，并且不会在我使用python2或python3的计算机上出现)

根据您的需要：

import  lxml.etree as et

x = et.parse("test.xml", parser=et.HTMLParser(remove_comments=True))
print(et.tostring(x))

xml解析器应该正确识别文档元素标记前面的注释节点。。。这可能是lxml中的一个错误……您是否可以包括加载文件时使用的代码，以及XML文件本身是否可以再包含几行代码？您是否确实知道您的XML是有效的？正确关闭的评论不应该是问题。退房

from lxml import etree
from lxml import objectify
import argparse
import os

parser = argparse.ArgumentParser()
parser.add_argument("path", type=str, nargs="+")
parser.add_argument('-e',
                    '--extension',
                    default='',
                    help='File extension to filter by.')

args = parser.parse_args()
name_pattern = "*" + args.extension
my_dir = args.path[0]

for dir_path, subdir_list, file_list in os.walk(my_dir):
    for name_pattern in file_list:
        full_path = os.path.join(dir_path, name_pattern)
        # print(full_path)
        # print(file_list)


def getsMeet(file_list):
    for filename in sorted(file_list):
        filename=my_dir + filename
        yield filename

def parseXML():
    """
    from mouse parsing a file with objectify
    http://www.blog.pythonlibrary.org/2012/06/06/parsing-xml-with-python-using-lxml-objectify/
    """
    for file in getsMeet(file_list):
        with open(file) as f:
            xml = f.read()

            root = objectify.fromstring(xml)
            print(root.tag)
            # print(objectify.dump(root))
            race = objectify.Element("race")
            print(objectify.dump(race))


parseXML()

<!-- Copyright Notice: © 2010 Racing NSW (and other parties working with it). NSW racing information,including fields, form and results, is subject to copyright which is owned by Racing NSW and other parties working with it. -->
<meeting id="42977" barriertrial="0" venue="Rosehill Gardens" date="2016-05-21T00:00:00" gearchanges="-1" stewardsreport="-1" gearlist="-1" racebook="0" postracestewards="0" meetingtype="TAB" rail="Timing - Electronic : Rail - +6m" weather="Fine      " trackcondition="Good 3    " nomsdeadline="2016-05-16T11:00:00" weightsdeadline="2016-05-17T16:00:00" acceptdeadline="2016-05-18T09:00:00" jockeydeadline="2016-05-18T12:00:00">
</meeting>

import  lxml.etree as et

x = et.parse("test.xml", parser=et.HTMLParser(remove_comments=True))
print(et.tostring(x))