Python 将不同大小的嵌套XML元素提取到
假设我们有一个任意的XML文档,如下所示Python 将不同大小的嵌套XML元素提取到,python,xml,pandas,xml-parsing,Python,Xml,Pandas,Xml Parsing,假设我们有一个任意的XML文档,如下所示 <?xml version="1.0" encoding="UTF-8"?> <programs xmlns="http://something.org/schema/s/program"> <program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-insta
<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://something.org/schema/s/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://something.org/schema/s/program http://something.org/schema/s/program.xsd">
<orgUnitId>Organization 1</orgUnitId>
<requiredLevel>academic bachelor</requiredLevel>
<requiredLevel>academic master</requiredLevel>
<programDescriptionText xml:lang="nl">Here is some text; blablabla</programDescriptionText>
<searchword xml:lang="nl">Scrum master</searchword>
</program>
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://something.org/schema/s/program http://something.org/schema/s/program.xsd">
<requiredLevel>bachelor</requiredLevel>
<requiredLevel>academic master</requiredLevel>
<requiredLevel>academic bachelor</requiredLevel>
<orgUnitId>Organization 2</orgUnitId>
<programDescriptionText xml:lang="nl">Text from another organization about some stuff.</programDescriptionText>
<searchword xml:lang="nl">Excutives</searchword>
</program>
<program xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<orgUnitId>Organization 3</orgUnitId>
<programDescriptionText xml:lang="nl">Also another huge text description from another organization.</programDescriptionText>
<searchword xml:lang="nl">Negotiating</searchword>
<searchword xml:lang="nl">Effective leadership</searchword>
<searchword xml:lang="nl">negotiating techniques</searchword>
<searchword xml:lang="nl">leadership</searchword>
<searchword xml:lang="nl">strategic planning</searchword>
</program>
</programs>
当然,目标是创建一个数据帧。但是,由于XML文件中的每个节点都包含一个或多个元素,例如requiredLevel
或searchword
,因此当我通过以下方式将数据转换到数据帧时,当前正在丢失数据:
df=pd.DataFrame(list(itertools.zip_longest(organization,
description,level,searchword,
fillvalue=np.nan)),columns=dfcols)
或者使用给定的pd.Series
,或者其他我似乎不适合的解决方案
我的最佳选择是根本不使用列表,因为它们似乎无法正确索引数据。也就是说,我丢失了从第二个子节点到第X个子节点的数据。但现在我陷入困境,看不到任何其他选择
我的最终结果应该是这样的:
organization description level keyword
Organization 1 .... academic bachelor, Scrum master
academic master
Organization 2 .... bachelor, Executives
academic master,
academic bachelor
Organization 3 .... Negotiating,
Effective leadership,
negotiating techniques,
....
可以找到一个轻量级的
xml-to-dict
转换器。它可以通过处理名称空间来改进
def xml_to_dict(xml='', remove_namespace=True):
"""Converts an XML string into a dict
Args:
xml: The XML as string
remove_namespace: True (default) if namespaces are to be removed
Returns:
The XML string as dict
Examples:
>>> xml_to_dict('<text><para>hello world</para></text>')
{'text': {'para': 'hello world'}}
"""
def _xml_remove_namespace(buf):
# Reference: https://stackoverflow.com/a/25920989/1498199
it = ElementTree.iterparse(buf)
for _, el in it:
if '}' in el.tag:
el.tag = el.tag.split('}', 1)[1]
return it.root
def _xml_to_dict(t):
# Reference: https://stackoverflow.com/a/10077069/1498199
from collections import defaultdict
d = {t.tag: {} if t.attrib else None}
children = list(t)
if children:
dd = defaultdict(list)
for dc in map(_xml_to_dict, children):
for k, v in dc.items():
dd[k].append(v)
d = {t.tag: {k: v[0] if len(v) == 1 else v for k, v in dd.items()}}
if t.attrib:
d[t.tag].update(('@' + k, v) for k, v in t.attrib.items())
if t.text:
text = t.text.strip()
if children or t.attrib:
if text:
d[t.tag]['#text'] = text
else:
d[t.tag] = text
return d
buffer = io.StringIO(xml.strip())
if remove_namespace:
root = _xml_remove_namespace(buffer)
else:
root = ElementTree.parse(buffer).getroot()
return _xml_to_dict(root)
考虑构建一个包含逗号折叠文本值的词典列表。然后将列表传递到
pandas.DataFrame
构造函数中:
dicts = []
for node in root:
orgs = ", ".join([org.text for org in node.findall('.//{http://something.org/schema/s/program}orgUnitId')])
desc = ", ".join([desc.text for desc in node.findall('.//{http://something.org/schema/s/program}programDescriptionText')])
lvls = ", ".join([lvl.text for lvl in node.findall('.//{http://something.org/schema/s/program}requiredLevel')])
wrds = ", ".join([wrd.text for wrd in node.findall('.//{http://something.org/schema/s/program}searchword')])
dicts.append({'organization': orgs, 'description': desc, 'level': lvls, 'keyword': wrds})
final_df = pd.DataFrame(dicts, columns=['organization','description','level','keyword'])
输出
print(final_df)
# organization description level keyword
# 0 Organization 1 Here is some text; blablabla academic bachelor, academic master Scrum master
# 1 Organization 2 Text from another organization about some stuff. bachelor, academic master, academic bachelor Excutives
# 2 Organization 3 Also another huge text description from anothe... Negotiating, Effective leadership, negotiating...
看了一段时间代码后,我似乎无法掌握它。所以我想知道Python从
df
向您抛出了什么样的输出。在我的例子中,我遇到了几个名称和属性错误。此外,根据您提供的第一个链接,我能够将xml传递到字典中。但是,运行提供的for循环
不起作用。谢谢你的评论,我会看看的。没有意识到导入另一个模块。您不必遵循链接,我也将xml\u的实现发布到了dict
。我只是想澄清一下,这不是“我的”代码。虽然两个代码都为我的答案提供了可能的解决方案,但我必须承认最后一个代码在我的情况下是有效的。在第一个答案中,我一直在函数本身中遇到几个错误。最后一个被接受的答案作为一个解决方案,效果很好。然而,有一个小问题很容易解决:如果数据具有NoneType
属性并抛出错误,则可以将行更改为desc=“,”.join([str(desc.text)代表node.findall('./{xml_path}Element'))中的desc)
谢谢你们的支持
dicts = []
for node in root:
orgs = ", ".join([org.text for org in node.findall('.//{http://something.org/schema/s/program}orgUnitId')])
desc = ", ".join([desc.text for desc in node.findall('.//{http://something.org/schema/s/program}programDescriptionText')])
lvls = ", ".join([lvl.text for lvl in node.findall('.//{http://something.org/schema/s/program}requiredLevel')])
wrds = ", ".join([wrd.text for wrd in node.findall('.//{http://something.org/schema/s/program}searchword')])
dicts.append({'organization': orgs, 'description': desc, 'level': lvls, 'keyword': wrds})
final_df = pd.DataFrame(dicts, columns=['organization','description','level','keyword'])
print(final_df)
# organization description level keyword
# 0 Organization 1 Here is some text; blablabla academic bachelor, academic master Scrum master
# 1 Organization 2 Text from another organization about some stuff. bachelor, academic master, academic bachelor Excutives
# 2 Organization 3 Also another huge text description from anothe... Negotiating, Effective leadership, negotiating...