如何有效地提取！[CDATA[]>使用python从xml获取内容？_Python_Xml_Python 2.7_Pandas_Lxml

如何有效地提取！[CDATA[]>使用python从xml获取内容？

python xml python-2.7 pandas

如何有效地提取！[CDATA[]>使用python从xml获取内容？,python,xml,python-2.7,pandas,lxml,Python,Xml,Python 2.7,Pandas,Lxml,我有以下xml： <?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23"> <document><![CDATA["@username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING ]]></document&g

我有以下xml：

<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
    <document><![CDATA["@username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING       ]]></document>
    <document><![CDATA[Ugh      ]]></document>
    <document><![CDATA[YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt       ]]></document>
    <document><![CDATA[@username Shout out to me????        ]]></document>
</author>

这就是我所尝试的：

from bs4 import BeautifulSoup
x='/Users/user/PycharmProjects/TratandoDeMejorarPAN/test.xml'
y = BeautifulSoup(open(x), 'xml')
out = [y.author.document]
print out

这是输出：

[<document>"@username: That boner came at the wrong time ???? http://t.co/5XgDyCaCjR" HELP I'M DYING        </document>]

此输出的问题是我不应该获取。如何删除标记并获取列表中此xml的所有元素？

这里有几个错误。在这里询问有关选择库的问题是违反规则的，因此我忽略了这部分问题

您需要传入文件句柄，而不是文件名

也就是说：y=BeautifulSoupopenx

您需要告诉BeautifulSoup它正在处理XML

即：y=BeautifulSoupopenx，'xml'

CDATA节不创建元素。您不能在DOM中搜索它们，因为它们在DOM中不存在；它们只是语法糖。只需查看文档正下方的文本，不要尝试搜索名为CDATA的内容

再说一遍，有点不同：与foo完全相同。CDATA节的不同之处在于，它内部的所有内容都自动转义，这意味着]]被解释为hello。但是，您无法从解析的对象树中判断文档中包含的CDATA节是带literal的，还是带and的原始文本节。这是经过设计的，对于任何兼容的XMLDOM实现都是如此

现在，我们来看看一些实际有效的代码：

import bs4

doc="""
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
    <document><![CDATA["@username: That came at the wrong time ????" HELP I'M DYING       ]]></document>
    <document><![CDATA[Ugh      ]]></document>
    <document><![CDATA[YES !!!! WE GO FOR IT.       ]]></document>
    <document><![CDATA[@username Shout out to me????        ]]></document>
</author>
"""

doc_el = bs4.BeautifulSoup(doc, 'xml')
print [ el.text for el in doc_el.findAll('document') ]

如果要读取文件，请用openfilename“r”替换doc。

Modern BeautifulSoup已经在后台使用了lxml。总之，你能有效地量化你的意思吗？…也就是说-CDATA根本不会创建单独的元素-它只是转义文本的语法糖。因此，您希望直接从文档中读取该文本。@charlesduff感谢您提供的提要。由于我有很多像这样的大型xml，所以我高效地快速引用。所以，只考虑CPU性能，而不考虑内存使用情况？你需要使用BeautifulSoupx，xml。我不明白这里发生了什么：user\u content=[doc\u el.findAll'document'中el的el.text]你能提供一些解释吗？。为什么要使用文本以及如何使用.findAllwas？从helpel中，在从bs4.element.Tag继承的数据描述符下，请参见：text-获取所有子字符串，使用给定的分隔符连接。因此，诀窍在于告诉BeautifulSoup将数据解析为XML，然后它就可以工作，并且可以提取CDATA信息。如果您使用的是LXML，它将不起作用，而大多数教程都让您使用LXML。我已经有了一些工作代码，所以我使用XML创建了一个新对象，并能够让它显示注释。comment_content=bscontent，comment_content.findAll'txt中el的xml:print'comment->'，el.text

import bs4

doc="""
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
    <document><![CDATA["@username: That came at the wrong time ????" HELP I'M DYING       ]]></document>
    <document><![CDATA[Ugh      ]]></document>
    <document><![CDATA[YES !!!! WE GO FOR IT.       ]]></document>
    <document><![CDATA[@username Shout out to me????        ]]></document>
</author>
"""

doc_el = bs4.BeautifulSoup(doc, 'xml')
print [ el.text for el in doc_el.findAll('document') ]