如何使用Python解析XML_Python_Xml_Parsing_Python 3.x_Xml Parsing

如何使用Python解析XML

python xml parsing python-3.x

如何使用Python解析XML,python,xml,parsing,python-3.x,xml-parsing,Python,Xml,Parsing,Python 3.x,Xml Parsing,我想解析一个网站上的xml，有人能帮我吗这是xml，我只想获取信息 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"> <url> <loc> ht

我想解析一个网站上的xml，有人能帮我吗

这是xml，我只想获取信息

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>
http://www.habergazete.com/haber-detay/1/69364/cAYKUR-3-bin-500-personel-alimi-yapacagini-duyurdu-cAYKUR-3-bin-500-personel-alim-sarlari--2015-01-29.html
</loc>
<news:news>
<news:publication>
<news:name>Haber Gazete</news:name>
<news:language>tr</news:language>
</news:publication>
<news:publication_date>2015-01-29T15:04:01+02:00</news:publication_date>
<news:title>
ÇAYKUR 3 bin 500 personel alımı yapacağını duyurdu! (ÇAYKUR 3 bin 500 personel alım şarları)
</news:title>
</news:news>
<image:image>
<image:loc>
http://www.habergazete.com/resimler/haber/haber_detay/611x395-alim-54c8f335b176e-1422536816.jpg
</image:loc>
</image:image>
</url>

有什么建议吗

谢谢：）

首先，您显示的XML格式不正确，因此解析它应该会引发一个异常——它缺少最后的结束语

。我怀疑您只是没有向我们展示您试图解析的实际XML

一旦您解决了这个问题（例如，如果XML数据实际上以某种方式被截断，则通过解析

xmlData+'

），您就会遇到名称空间问题，这很容易显示：

>>> et.tostring(root)
b'<ns0:urlset xmlns:ns0="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:ns1="http://www.google.com/schemas/sitemap-news/0.9" xmlns:ns2="http://www.google.com/schemas/sitemap-image/1.1">\n<ns0:url>\n<ns0:loc>\nhttp://www.habergazete.com/haber-detay/1/69364/cAYKUR-3-bin-500-personel-alimi-yapacagini-duyurdu-cAYKUR-3-bin-500-personel-alim-sarlari--2015-01-29.html\n</ns0:loc>\n<ns1:news>\n<ns1:publication>\n<ns1:name>Haber Gazete</ns1:name>\n<ns1:language>tr</ns1:language>\n</ns1:publication>\n<ns1:publication_date>2015-01-29T15:04:01+02:00</ns1:publication_date>\n<ns1:title>\n&#199;AYKUR 3 bin 500 personel al&#305;m&#305; yapaca&#287;&#305;n&#305; duyurdu! (&#199;AYKUR 3 bin 500 personel al&#305;m &#351;arlar&#305;)\n</ns1:title>\n</ns1:news>\n<ns2:image>\n<ns2:loc>\nhttp://www.habergazete.com/resimler/haber/haber_detay/611x395-alim-54c8f335b176e-1422536816.jpg\n</ns2:loc>\n</ns2:image>\n</ns0:url></ns0:urlset>'

…你终于找到了你要找的元素

顺便说一句，我们中的一些人发现

BeautifulSoup

，当我们不需要从

etree

或

lxml

获得额外的速度时，更容易用于XML解析任务，请先尝试

print（xmlData）

？确保你们得到了数据。我确定，我可以得到所有数据：）xml是这样的，在下一行，它是另一个新闻，它继续这样，因为这个，我并没有粘贴它，若我把这个xml的“”结尾，它将是一样的，你们可以想象这样。因为这是rss xml向我展示了很多新闻，所以我不想把所有的xml都放在这里。@ufuk.dogan，好吧，但是你应该注意到你的Q的文本中有一点细节——把示例剪到重现问题所需的最小值是好的，确实是明智的，但不必事先通知，这会导致错误（例如，格式错误的xml），因为这会增加回答者注意、诊断和修复问题的负担。无论如何，我继续展示了您的名称空间问题，以及您需要一些

xpath

语法来递归搜索树，并修复了这两个问题以及错误的截断，展示了一个有效的解决方案。谢谢：）这段代码运行良好，这非常有帮助：D

>>> et.tostring(root)
b'<ns0:urlset xmlns:ns0="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:ns1="http://www.google.com/schemas/sitemap-news/0.9" xmlns:ns2="http://www.google.com/schemas/sitemap-image/1.1">\n<ns0:url>\n<ns0:loc>\nhttp://www.habergazete.com/haber-detay/1/69364/cAYKUR-3-bin-500-personel-alimi-yapacagini-duyurdu-cAYKUR-3-bin-500-personel-alim-sarlari--2015-01-29.html\n</ns0:loc>\n<ns1:news>\n<ns1:publication>\n<ns1:name>Haber Gazete</ns1:name>\n<ns1:language>tr</ns1:language>\n</ns1:publication>\n<ns1:publication_date>2015-01-29T15:04:01+02:00</ns1:publication_date>\n<ns1:title>\n&#199;AYKUR 3 bin 500 personel al&#305;m&#305; yapaca&#287;&#305;n&#305; duyurdu! (&#199;AYKUR 3 bin 500 personel al&#305;m &#351;arlar&#305;)\n</ns1:title>\n</ns1:news>\n<ns2:image>\n<ns2:loc>\nhttp://www.habergazete.com/resimler/haber/haber_detay/611x395-alim-54c8f335b176e-1422536816.jpg\n</ns2:loc>\n</ns2:image>\n</ns0:url></ns0:urlset>'

<ns0:loc>

>>> root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}loc')
[<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}loc' at 0x1022a50e8>]