如何在python中正确解析xml注释_Python_Xml_Parsing_Comments

如何在python中正确解析xml注释

python xml parsing

如何在python中正确解析xml注释,python,xml,parsing,comments,Python,Xml,Parsing,Comments,我最近一直在使用Python，我想从给定的xml文件中提取信息。问题是信息的存储非常糟糕，格式如下 <Content> <tags> .... </tags> <![CDATA["string1"; "string2"; .... ]]> </Content> 然而，我的输出是零。如何接收注释数据？（同样，我在使用Python）问题是您的评论似乎不标准。标准注释是这样的这些注释可以用beautifulsou解析，例

我最近一直在使用Python，我想从给定的xml文件中提取信息。问题是信息的存储非常糟糕，格式如下

<Content>
   <tags>
   ....
   </tags>
<![CDATA["string1"; "string2"; ....
]]>
</Content>

然而，我的输出是零。如何接收注释数据？（同样，我在使用Python）

问题是您的评论似乎不标准。标准注释是这样的

这些注释可以用

beautifulsou

解析，例如：

from bs4 import BeautifulSoup, Comment

xml = """<Content>
   <tags>
   ...
   </tags>
<!--[CDATA["string1"; "string2"; ....]]-->
</Content>"""
soup = BeautifulSoup(xml)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
print(comments)

这将返回：

“string1”；“string2”

您需要创建基于SAX的解析器，而不是基于DOM的解析器。尤其是像你这样大的文件

基于sax的解析器要求您编写自己的控制逻辑来描述数据的存储方式。它比简单地将其加载到DOM中要复杂，但要快得多，因为它一行一行地加载，而不是一次加载整个文档。这使得它的优势在于，它可以通过评论处理像您这样的不可靠案例

在构建处理程序时，您可能希望使用解析器中的来提取这些注释

我会给你一个如何建立一个工作的例子，但它是一个很长的时间，因为我自己做了。有很多关于如何在线构建基于sax的解析器的指南，将把讨论推迟到另一个线程。

使用Python 3.8，您可以在元素树中插入注释在XML中读取属性、值、标记和注释的示例代码

import csv, sys
import xml.etree.ElementTree as ET


parser = ET.XMLParser(target=ET.TreeBuilder(insert_comments=True))  # Python 3.8
            tree = ET.parse(infile_path, parser)

            csvwriter.writerow(TextWorkAdapter.CSV_HEADERS)

            COMMENT = ""
            TAG =""
            NAME=""

            # Get the comment nodes
            for node in tree.iter():
                if "function Comment" in str(node.tag):
                    COMMENT = node.text
                else:
                    #read tag
                    TAG = node.tag  # string

                    #read attributes 
                    NAME= node.attrib.get("name")  # ID
                      
                    #Value
                    VALUE = node.text  # value

                    print(TAG, NAME, VALUE, COMMENT)

如果您只需要一行，那么考虑打开文件并尝试用String／ReGEX函数查找行。此外，请参阅它不是单行，如我所说，我有20000行作为列表元素。但是，对于一个巨大的XML文档，请考虑使用文件打开和读取策略，DOM创建/解析/演练可能非常耗时。

import re
xml = """<Content>
   <tags>
   asd
   </tags>
<![CDATA["string1"; "string2"; ....]]>
</Content>"""
for i in re.findall("<!.+>",xml):
    for j in re.findall('\".+\"', i):
        print(j)

import csv, sys
import xml.etree.ElementTree as ET


parser = ET.XMLParser(target=ET.TreeBuilder(insert_comments=True))  # Python 3.8
            tree = ET.parse(infile_path, parser)

            csvwriter.writerow(TextWorkAdapter.CSV_HEADERS)

            COMMENT = ""
            TAG =""
            NAME=""

            # Get the comment nodes
            for node in tree.iter():
                if "function Comment" in str(node.tag):
                    COMMENT = node.text
                else:
                    #read tag
                    TAG = node.tag  # string

                    #read attributes 
                    NAME= node.attrib.get("name")  # ID
                      
                    #Value
                    VALUE = node.text  # value

                    print(TAG, NAME, VALUE, COMMENT)