Python 有效地将不一致的XML解析为数据帧
这个问题是关于用以下结构解析不一致的XMLPython 有效地将不一致的XML解析为数据帧,python,xml,pandas,xml-parsing,Python,Xml,Pandas,Xml Parsing,这个问题是关于用以下结构解析不一致的XML <items> <item> <propertyA>1</propertyA> <propertyB>B</propertyB> <propertyC>2017</propertyC> </item> <item> <propertyB>BB</propertyB> &l
<items>
<item>
<propertyA>1</propertyA>
<propertyB>B</propertyB>
<propertyC>2017</propertyC>
</item>
<item>
<propertyB>BB</propertyB>
<propertyD>D-2017</propertyD>
</item>
<item>
<propertyE>E</propertyE>
<propertyF>11:25</propertyF>
</item>
</items>
正如您肯定可以看到的,我是通过将新的pd.Series附加到数据帧来实现的。这种方法似乎是防弹的(至少对我来说是D),而且我的数据是一致的
问题是,10万件物品的效率很低,需要很长时间。
你推荐什么
谢谢你花时间讨论我的问题。我是python新手,因此我将感谢您的耐心。考虑添加数据帧,而不是使用
pd.concat
(一种快速行/列绑定方法)添加序列,如果数据帧列表中的列不对齐,则会填充NaN。此外,下面使用迭代转换到dataframe的字典列表运行不同的解析:
import xml.etree.ElementTree as ET
import pandas as pd
xml_str = '''
<items>
<item>
<propertyA>1</propertyA>
<propertyB>B</propertyB>
<propertyC>2017</propertyC>
</item>
<item>
<propertyB>BB</propertyB>
<propertyD>D-2017</propertyD>
</item>
<item>
<propertyE>E</propertyE>
<propertyF>11:25</propertyF>
</item>
</items>'''
dfs = []
def load_inconsistent_xml(xml):
data = []; inner = {}
root = ET.fromstring(xml)
for child in root.iterfind('item'):
for grandchild in child.iterfind('./*'):
inner[grandchild.tag] = grandchild.text
data.append(inner)
dfs.append(pd.DataFrame(data))
data = []; inner = {}
finaldf = pd.concat(dfs).reset_index(drop=True)
print(finaldf)
# propertyA propertyB propertyC propertyD propertyE propertyF
# 0 1 B 2017 NaN NaN NaN
# 1 NaN BB NaN D-2017 NaN NaN
# 2 NaN NaN NaN NaN E 11:25
将xml.etree.ElementTree作为ET导入
作为pd进口熊猫
xml_str=''
1.
B
2017
BB
D-2017
E
11:25
'''
dfs=[]
def加载不一致的xml(xml):
数据=[];内部={}
root=ET.fromstring(xml)
对于root.iterfind('item')中的子级:
对于child.iterfind('./*')中的孙辈:
内部[granter.tag]=granter.text
data.append(内部)
dfs.append(pd.DataFrame(数据))
数据=[];内部={}
finaldf=pd.concat(dfs).重置索引(drop=True)
打印(最终版本)
#不动产不动产Y不动产Y不动产Y不动产Y不动产
#01 B 2017年南南南
#1 NaN BB NaN D-2017 NaN NaN
#2南E 11:25
import xml.etree.ElementTree as ET
import pandas as pd
xml_str = '''
<items>
<item>
<propertyA>1</propertyA>
<propertyB>B</propertyB>
<propertyC>2017</propertyC>
</item>
<item>
<propertyB>BB</propertyB>
<propertyD>D-2017</propertyD>
</item>
<item>
<propertyE>E</propertyE>
<propertyF>11:25</propertyF>
</item>
</items>'''
dfs = []
def load_inconsistent_xml(xml):
data = []; inner = {}
root = ET.fromstring(xml)
for child in root.iterfind('item'):
for grandchild in child.iterfind('./*'):
inner[grandchild.tag] = grandchild.text
data.append(inner)
dfs.append(pd.DataFrame(data))
data = []; inner = {}
finaldf = pd.concat(dfs).reset_index(drop=True)
print(finaldf)
# propertyA propertyB propertyC propertyD propertyE propertyF
# 0 1 B 2017 NaN NaN NaN
# 1 NaN BB NaN D-2017 NaN NaN
# 2 NaN NaN NaN NaN E 11:25