Python解析具有异常标记名的XML（atom:link）_Python_Xml

Python解析具有异常标记名的XML（atom:link）

python xml

Python解析具有异常标记名的XML（atom:link）,python,xml,Python,Xml,我试图从下面的XML中解析出href。有多个workspace标记，下面我只显示一个 <workspaces> <workspace> <name>practice</name> <atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="alternate" href="https://www.my-geoserver.com/geoserver/rest/works

我试图从下面的XML中解析出

href

。有多个

workspace

标记，下面我只显示一个

<workspaces>
  <workspace>
    <name>practice</name>
    <atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="alternate" href="https://www.my-geoserver.com/geoserver/rest/workspaces/practice.xml" type="application/xml"/>
  </workspace>
</workspaces>

如果搜索“工作区”，则返回对象：

lst = tree.findall('workspace')
print(lst)

其结果是：

[<Element 'workspace' at 0x039E70F0>, <Element 'workspace' at 0x039E71B0>, <Element 'workspace' at 0x039E7240>]

但它们都无法隔离标记，事实上，最后一个会产生错误

SyntaxError:在前缀映射中找不到前缀“atom”

如何获取具有这些标记名的href的所有实例？

我找到的简易解决方案：

>>> y=BeautifulSoup(x)
>>> y
<workspaces>
<workspace>
<name>practice</name>
<atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="alternate" href="https://www.my-geoserver.com/geoserver/rest/workspaces/practice.xml" type="application/xml">
</atom:link></workspace>
</workspaces>
>>> c = y.workspaces.workspace.findAll("atom:link")
>>> c
[<atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="alternate" href="https://www.my-geoserver.com/geoserver/rest/workspaces/practice.xml" type="application/xml">
</atom:link>]
>>>

>y=BeautifulSoup（x）
>>>y
实践
>>>c=y.workspaces.workspace.findAll（“原子：链接”）
>>>c
[
]
>>>

我找到的简单解决方案：

>>> y=BeautifulSoup(x)
>>> y
<workspaces>
<workspace>
<name>practice</name>
<atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="alternate" href="https://www.my-geoserver.com/geoserver/rest/workspaces/practice.xml" type="application/xml">
</atom:link></workspace>
</workspaces>
>>> c = y.workspaces.workspace.findAll("atom:link")
>>> c
[<atom:link xmlns:atom="http://www.w3.org/2005/Atom" rel="alternate" href="https://www.my-geoserver.com/geoserver/rest/workspaces/practice.xml" type="application/xml">
</atom:link>]
>>>

>y=BeautifulSoup（x）
>>>y
实践
>>>c=y.workspaces.workspace.findAll（“原子：链接”）
>>>c
[
]
>>>

对于发现此问题的其他人，冒号前面的部分（在本例中为

atom

）被称为名称空间，并导致此处出现问题。解决方案非常简单：

myUrl = 'https://www.my-geoserver.com/geoserver/rest/workspaces'
headers = {'Accept': 'text/xml'}
resp = requests.get(myUrl,auth=('admin','my_password'),headers=headers)
stuff = resp.text
to_parse=BeautifulSoup(stuff, "xml")

for item in to_parse.find_all("atom:link"):
    print(item)

感谢Saket Mittal为我指点美丽的乌苏图书馆。关键是使用

xml

作为BeautifulSoup函数中的参数。使用

lxml

将无法正确解析名称空间并忽略它们。

对于发现此问题的其他人，冒号前面的部分（在本例中为

atom

）称为名称空间，并导致此问题。解决方案非常简单：

myUrl = 'https://www.my-geoserver.com/geoserver/rest/workspaces'
headers = {'Accept': 'text/xml'}
resp = requests.get(myUrl,auth=('admin','my_password'),headers=headers)
stuff = resp.text
to_parse=BeautifulSoup(stuff, "xml")

for item in to_parse.find_all("atom:link"):
    print(item)

感谢Saket Mittal为我指点美丽的乌苏图书馆。关键是使用

xml

作为BeautifulSoup函数中的参数。使用

lxml

无法正确解析名称空间并忽略它们。

我的输出是simple[]，它一定与resp.text的格式有关，据我所知，resp.text就是文本。如果我使用y.workspaces.findAll（“workspace”），它是有效的，但这不是我想要的。我的输出是simple[]，它必须与resp.text的格式有关，就我所知，resp.text就是文本。如果我使用y.workspaces.findAll（“workspace”），它是有效的，但这不是我想要的。