在Python中解析XML节点中的文本_Python_Xml_Python 3.x_Elementtree

在Python中解析XML节点中的文本

python xml python-3.x

在Python中解析XML节点中的文本,python,xml,python-3.x,elementtree,Python,Xml,Python 3.x,Elementtree,我正在尝试从站点地图中提取URL，如下所示：我已解压缩.xml.gz文件并将其保存为.xml文件。结构如下所示： <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="

我正在尝试从站点地图中提取URL，如下所示：

我已解压缩.xml.gz文件并将其保存为.xml文件。结构如下所示：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xhtml="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
    <loc>https://www.bestbuy.com/</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008</loc>
    <priority>0.0</priority>
</url>
<url>
    <loc>https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647</loc>
    <priority>0.0</priority>
</url>

import xml.etree.ElementTree as ET
tree = ET.parse('my_local_filepath')
root = tree.getroot()

value = root.findall(".//loc")

但是，没有任何内容被加载到值中。我的目标是提取loc节点之间的所有URL，并将其打印到一个新的平面文件中。我哪里出错了？

我们可以遍历URL，将它们放入列表，并将它们写入文件中，如下所示：

from xml.etree import ElementTree as ET

tree = ET.parse('test.xml')
root = tree.getroot()

name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}'

urls = []
for child in root.iter():
    for block in child.findall('{}url'.format(name_space)):
        for url in block.findall('{}loc'.format(name_space)):
            urls.append('{}\n'.format(url.text))

with open('sample_urls.txt', 'w+') as f:
    f.writelines(urls)

注意，我们需要从openurlset定义中附加名称空间，以正确解析xml

我们可以遍历URL，将其放入列表并将其写入文件，如下所示：

from xml.etree import ElementTree as ET

tree = ET.parse('test.xml')
root = tree.getroot()

name_space = '{http://www.sitemaps.org/schemas/sitemap/0.9}'

urls = []
for child in root.iter():
    for block in child.findall('{}url'.format(name_space)):
        for url in block.findall('{}loc'.format(name_space)):
            urls.append('{}\n'.format(url.text))

with open('sample_urls.txt', 'w+') as f:
    f.writelines(urls)

注意，我们需要从openurlset定义中附加名称空间，以正确解析xml

您的尝试很接近，但正如mzjn在评论中所说，您没有说明默认名称空间（

xmlns=）http://www.sitemaps.org/schemas/sitemap/0.9“

）

下面是一个如何说明名称空间的示例：

将xml.etree.ElementTree作为ET导入
tree=ET.parse（'my\u local\u filepath'）
ns={“sm”：”http://www.sitemaps.org/schemas/sitemap/0.9"}
对于tree.findall（“.//sm:loc”，ns）中的元素：
打印（元素文本）

输出：

https://www.bestbuy.com/
https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008
https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647

注意，我使用了名称空间前缀

sm

，但您可以使用任何名称空间前缀

有关在ElementTree中使用名称空间解析XML的详细信息。

您的尝试很接近，但正如mzjn在评论中所说，您没有说明默认名称空间（

xmlns=）http://www.sitemaps.org/schemas/sitemap/0.9“

）

下面是一个如何说明名称空间的示例：

将xml.etree.ElementTree作为ET导入
tree=ET.parse（'my\u local\u filepath'）
ns={“sm”：”http://www.sitemaps.org/schemas/sitemap/0.9"}
对于tree.findall（“.//sm:loc”，ns）中的元素：
打印（元素文本）

输出：

https://www.bestbuy.com/
https://www.bestbuy.com/site/3d-printers/3d-printer-filament/pcmcat335400050008.c?id=pcmcat335400050008
https://www.bestbuy.com/site/3d-printers/3d-printing-accessories/pcmcat748300527647.c?id=pcmcat748300527647

注意，我使用了名称空间前缀

sm

，但您可以使用任何名称空间前缀

有关在ElementTree中使用名称空间解析XML的详细信息。

我知道这有点像僵尸回复，但实际上我刚刚在github上发布了一个工具，它可以完全满足您的需要。还有Python！因此，请随意从源代码中获取所需内容（或按原样使用）。我想我会用这个来评论，这样其他遇到这个帖子的人都会看到它

这是：

我知道这有点像僵尸回复，但实际上我刚刚在github上发布了一个工具，它可以完全满足您的需求。还有Python！因此，请随意从源代码中获取所需内容（或按原样使用）。我想我会用这个来评论，这样其他遇到这个帖子的人都会看到它

这是：

它不起作用，我的URL数组仍然是空的。不确定我试图打开的实际XML文件是否存在格式问题？我正在将.xml.gz文件与我链接的文件相似，并使用GzipFile解压它。对，我想我在测试文件中切掉了一些重要信息，并将其附加到解析中应该会有所帮助。我更新了答案。它不起作用，我的URL数组仍然是空的。不确定我试图打开的实际XML文件是否存在格式问题？我正在将.xml.gz文件与我链接的文件相似，并使用GzipFile解压它。对，我想我在测试文件中切掉了一些重要信息，并将其附加到解析中应该会有所帮助。我更新了答案。您没有考虑名称空间。请注意，您没有考虑名称空间。看见