Python解析xml并构建数据框架_Python_Xml

Python解析xml并构建数据框架

python xml

Python解析xml并构建数据框架,python,xml,Python,Xml,我有一个结构如下的xml文件。我想提取pa、pb、pc的属性，并将它们保存到不同的数据帧中 <root0> … <root1> … <root2> … <root3> <class> <pa> <attributes> <a1>70</a1> <a2>1</a2>

我有一个结构如下的xml文件。我想提取pa、pb、pc的属性，并将它们保存到不同的数据帧中

<root0>
…
<root1>
…
<root2>
…
<root3>
    <class>
        <pa>
            <attributes>
            <a1>70</a1>
            <a2>1</a2>
            </attributes>
        </pa>
    </class>
    
    <class>
        <pb>
            <attributes>
            <b1>xx</b1>
            <b2>xx</b2>
            </attributes>
        </pb>
    </class>
    
    <class>
        <pc>
            <attributes>
            <c1>yy</c1>
            <c2>yy</c2>
            </attributes>
        </pc>
    </class>
    
    …..

但它只返回“None”，而不是“a1”或“a2”

有人能帮忙解释一下吗

谢谢

这是我的解决方案。但它看起来并不那么优雅

ns = {'n': '{http://www.xxx.yyy}'}

for item in root[3].findall('n:class',ns):
    for i in item.findall('n:pa',ns):
        j = mo.find('n:attributes/n:a1',ns)
        print (j.text)

它实际上取决于XML文档的结构和要构造的数据框架的结构

当“类”中的所有元素都是必需的并且没有缺少的元素时，可以使用XPath方法：

# here we define xml
xml = '''<root3 xmlms="http://www.xxx.yyy">
    <class>
        <pa>
            <attributes>
            <a1>70</a1>
            <a2>1</a2>
            </attributes>
        </pa>
    </class>
    
    <class>
        <pb>
            <attributes>
            <b1>xx</b1>
            <b2>xx</b2>
            </attributes>
        </pb>
    </class>
    
    <class>
        <pc>
            <attributes>
            <c1>yy</c1>
            <c2>yy</c2>
            </attributes>
        </pc>
    </class>
    </root3>'''

表示法

*[local-name（）=“class”]

允许忽略名称空间

这种情况下的结果将是：

{'a1'：['70']，'a2'：['1']，'b1'：['xx']，'b2'：['xx']，'c1'：['yy']，'c2'：['yy']}

从该字典可以轻松构建数据框架：

import pandas as pd

df = pd.DataFrame(d)

df.head()

输出：

如果“属性”元素下可能缺少一些元素a1、a2、b1、b2、c1、c2等，这种方法可能效率不高，因为它将检索不同长度的列表，不允许从这些列表构建数据帧。在这种情况下，元素上的迭代将是首选方法：

from lxml import etree as et
from collections import defaultdict

root = et.fromstring(xml)

d = defaultdict(list)

for _class in root.findall('class', root.nsmap):
    print(_class)
    a1 = _class.find('.//a1',root.nsmap)
    d['a1'].append(None if a1 is None else a1.text)
    a2 = _class.find('.//a2',root.nsmap)
    d['a2'].append(None if a2 is None else a2.text)
    b1 = _class.find('.//b1',root.nsmap)
    d['b1'].append(None if b1 is None else b1.text)
    b2 = _class.find('.//b2',root.nsmap)
    d['b2'].append(None if b2 is None else b2.text)
    c1 = _class.find('.//c1',root.nsmap)
    d['c1'].append(None if c1 is None else c1.text)
    c2 = _class.find('.//c2',root.nsmap)
    d['c2'].append(None if c2 is None else c2.text)
        
print(d)

这将给出输出

{'a1'：['70'，无，无]，'a2'：['1'，无，无]，'b1'：[None'，xx'，无]，'b2'：[None'，xx'，无]，'c1'：[None，无，'yy'，'c2'：[None，无，'yy']）

输出：

可以根据属性名称动态填充字典键：

from lxml import etree as et
from collections import defaultdict

root = et.fromstring(xml)

d = defaultdict(list)

for _class in root.findall('class', root.nsmap):
    for attribute in _class.findall('.//attributes/*',root.nsmap):
        d[attribute.tag].append(attribute.text)
        
print(d)

这将产生以下输出：

{'a1'：['70']，'a2'：['1']，'b1'：['xx']，'b2'：['xx']，'c1'：['yy']，'c2'：['yy']}

选择适合您需要的方法。

另一种方法

from simplified_scrapy import SimplifiedDoc, utils
xml = '''
<root0>
…
<root1>
…
<root2>
…
<root3>
    <class>
        <pa>
            <attributes>
            <a1>70</a1>
            <a2>1</a2>
            </attributes>
        </pa>
    </class>
    
    <class>
        <pb>
            <attributes>
            <b1>xx</b1>
            <b2>xx</b2>
            </attributes>
        </pb>
    </class>
    
    <class>
        <pc>
            <attributes>
            <c1>yy</c1>
            <c2>yy</c2>
            </attributes>
        </pc>
    </class>
</root3>
…..
'''

doc = SimplifiedDoc(xml)
classes = doc.root3.selects('class').select('attributes').children
# Or
# classes = doc.root3.selects('class').child.select('attributes').children
print (classes)
# Or
print ('-'*50)
classes = doc.root3.selects('class').child
for c in classes:
  print (c.tag, *c.select('attributes').children)

涉及到一个名称空间（

http://www.xxx.yyy

）。另请注意，

Pa

（在您的代码中）与

Pa

（在XML中）不同。@mzjn，是的，这是一个命名空间技巧。谢谢你的提示~~顺便说一句，爸爸只是我帖子中的一个输入错误，谢谢你帮我挑出来：）

from lxml import etree as et
from collections import defaultdict

root = et.fromstring(xml)

d = defaultdict(list)

for _class in root.findall('class', root.nsmap):
    print(_class)
    a1 = _class.find('.//a1',root.nsmap)
    d['a1'].append(None if a1 is None else a1.text)
    a2 = _class.find('.//a2',root.nsmap)
    d['a2'].append(None if a2 is None else a2.text)
    b1 = _class.find('.//b1',root.nsmap)
    d['b1'].append(None if b1 is None else b1.text)
    b2 = _class.find('.//b2',root.nsmap)
    d['b2'].append(None if b2 is None else b2.text)
    c1 = _class.find('.//c1',root.nsmap)
    d['c1'].append(None if c1 is None else c1.text)
    c2 = _class.find('.//c2',root.nsmap)
    d['c2'].append(None if c2 is None else c2.text)
        
print(d)

df = pd.DataFrame(d)
df.head()

from lxml import etree as et
from collections import defaultdict

root = et.fromstring(xml)

d = defaultdict(list)

for _class in root.findall('class', root.nsmap):
    for attribute in _class.findall('.//attributes/*',root.nsmap):
        d[attribute.tag].append(attribute.text)
        
print(d)

from simplified_scrapy import SimplifiedDoc, utils
xml = '''
<root0>
…
<root1>
…
<root2>
…
<root3>
    <class>
        <pa>
            <attributes>
            <a1>70</a1>
            <a2>1</a2>
            </attributes>
        </pa>
    </class>
    
    <class>
        <pb>
            <attributes>
            <b1>xx</b1>
            <b2>xx</b2>
            </attributes>
        </pb>
    </class>
    
    <class>
        <pc>
            <attributes>
            <c1>yy</c1>
            <c2>yy</c2>
            </attributes>
        </pc>
    </class>
</root3>
…..
'''

doc = SimplifiedDoc(xml)
classes = doc.root3.selects('class').select('attributes').children
# Or
# classes = doc.root3.selects('class').child.select('attributes').children
print (classes)
# Or
print ('-'*50)
classes = doc.root3.selects('class').child
for c in classes:
  print (c.tag, *c.select('attributes').children)

[[{'tag': 'a1', 'html': '70'}, {'tag': 'a2', 'html': '1'}], [{'tag': 'b1', 'html': 'xx'}, {'tag': 'b2', 'html': 'xx'}], [{'tag': 'c1', 'html': 'yy'}, {'tag': 'c2', 'html': 'yy'}]]
--------------------------------------------------
pa {'tag': 'a1', 'html': '70'} {'tag': 'a2', 'html': '1'}
pb {'tag': 'b1', 'html': 'xx'} {'tag': 'b2', 'html': 'xx'}
pc {'tag': 'c1', 'html': 'yy'} {'tag': 'c2', 'html': 'yy'}