Python 列出xml文件中要存储在数据框中的路径和数据_Python_Xml_Dataframe_Xpath_Elementtree

Python 列出xml文件中要存储在数据框中的路径和数据

python xml dataframe xpath

Python 列出xml文件中要存储在数据框中的路径和数据,python,xml,dataframe,xpath,elementtree,Python,Xml,Dataframe,Xpath,Elementtree,以下是一个xml文件： <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"> <SOAP-ENV:Header /> <SOAP-ENV:Body> <ADD_LandIndex_001> <CNTROLAREA> <BSR> <st

以下是一个xml文件：

<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Header />
  <SOAP-ENV:Body>
    <ADD_LandIndex_001>
      <CNTROLAREA>
        <BSR>
          <status>ADD</status>
          <NOUN>LandIndex</NOUN>
          <REVISION>001</REVISION>
        </BSR>
      </CNTROLAREA>
      <DATAAREA>
        <LandIndex>
          <reportId>AMI100031</reportId>
          <requestKey>R3278458</requestKey>
          <SubmittedBy>EN4871</SubmittedBy>
          <submittedOn>2015/01/06 4:20:11 PM</submittedOn>
          <LandIndex>
            <agreementdetail>
              <agreementid>001       4860</agreementid>
              <agreementtype>NATURAL GAS</agreementtype>
              <currentstatus>
                <status>ACTIVE</status>
                <statuseffectivedate>1965/02/18</statuseffectivedate>
                <termdate>1965/02/18</termdate>
              </currentstatus>
              <designatedrepresentative></designatedrepresentative>
            </agreementdetail>
          </LandIndex>
        </LandIndex>
      </DATAAREA>
    </ADD_LandIndex_001>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

像这样，我只需要使用函数

df=pd.DataFrame（）

来创建一个可以在excel工作表中导出的数据框架。我已经为路径列表提供了一部分，但是我无法从这些路径获取文本。我不知道lxml库是如何工作的。我尝试了函数

.text（）

和

text\u content（）

，但出现了一个错误

这是我的密码：

from lxml import etree
import xml.etree.ElementTree as et
from bs4 import BeautifulSoup
import pandas as pd

filename = 'file_try.xml'

with open(filename, 'r') as f: 
    soap = f.read() 

root = etree.XML(soap.encode())    
tree = etree.ElementTree(root)


mylist_path = []
mylist_data = []
mydico = {}
mylist = []

for target in root.xpath('//text()'):

    if len(target.strip())>0:       
        path = tree.getpath(target.getparent()).replace('SOAP-ENV:','')
        mydico[path] = target.text()

        mylist_path.append(path)
        mylist_data.append(target.text())
        mylist.append(mydico)

df=pd.DataFrame(mylist)
df.to_excel("data_xml.xlsx") 

print(mylist_path)
print(mylist_data)

谢谢你的帮助

下面是一个遍历XML树的示例。为此，需要递归函数。幸运的是，lxml为此提供了所有功能

from lxml import etree as et
from collections import defaultdict
import pandas as pd

d = defaultdict(list)
root = et.fromstring(xml)
tree = et.ElementTree(root)

def traverse(el, d):
    if len(list(el)) > 0:
        for child in el:
            traverse(child, d)
    else:
      if el.text is not None:
        d[tree.getelementpath(el)].append(el.text)

traverse(root, d)

df = pd.DataFrame(d)

df.head()

输出：

{
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': ['ADD'],
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN': ['LandIndex'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION': ['001'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId': ['AMI100031'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey': ['R3278458'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy': ['EN4871'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn': ['2015/01/06 4:20:11 PM'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid': ['001       4860'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype': ['NATURAL GAS'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status': ['ACTIVE'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate': ['1965/02/18'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate': ['1965/02/18']
}

请注意，字典

包含列表作为值。这是因为元素可以在XML中重复，否则最后一个值将覆盖前一个值。如果您的特定XML不是这样，请使用常规dict而不是defaultdict

d={}

，并使用赋值而不是追加

d[tree.getelementpath（el）]=el.text

从文件中读取时也是如此：

d = defaultdict(list)

with open('output.xml', 'rb') as file:
    root = et.parse(file).getroot()
    
tree = et.ElementTree(root)

def traverse(el, d):
    if len(list(el)) > 0:
        for child in el:
            traverse(child, d)
    else:
      if el.text is not None:
        d[tree.getelementpath(el)].append(el.text)

traverse(root, d)

df = pd.DataFrame(d)

print(d)

你能编辑你的问题并显示你想要的数据帧是什么样子吗？@Jack Fleeting，你明白我的问题了吗，我不知道是否清楚？谢谢，为什么一行值需要一个数据框？为什么不保留一本字典？不，我不确定我是否理解这个问题。这就是为什么我要求查看您的预期输出/数据帧。感谢您的关注！我需要将所有这些数据导出到excel工作表中，这就是我想要数据框的原因。谢谢alexandra的代码和解释！当我看到你们的输出时，我想这就是我所需要的。但是我不能得到它，因为我不理解函数et.fromstring（）是如何工作的。我需要输入一个字符串吗？我尝试使用以下几行：

filename='file_try.xml'，open（filename，'r'）为f:xml=f.read（）

但是我得到一个错误：“ValueError:不支持带有编码声明的Unicode字符串。请使用字节输入或不带声明的xml片段。”我需要将什么作为输入？谢谢你，亚历山德拉，祝你今天愉快！添加了从文件读取XML的示例。通常，它也适用于在声明中进行编码的XML。谢谢！我非常感谢你的帮助！我将研究您的代码以理解它，以及我从未使用过的lxml函数。然后，我将添加一行或循环以从路径中删除名称空间。我尝试了

replace（'SOAP-ENV:'，''）

，但它没有改变任何东西。祝你周末愉快MaicolHi@Alexandra Dudkina，我写了一篇新帖子，因为我需要一个不同的数据帧，路径不是标题，我可以使用它们的字符串。我试着调整你的代码，你会怎么看，你能检查一下你是否有时间吗？我会很感激的！顺致敬意，

d = defaultdict(list)

with open('output.xml', 'rb') as file:
    root = et.parse(file).getroot()
    
tree = et.ElementTree(root)

def traverse(el, d):
    if len(list(el)) > 0:
        for child in el:
            traverse(child, d)
    else:
      if el.text is not None:
        d[tree.getelementpath(el)].append(el.text)

traverse(root, d)

df = pd.DataFrame(d)

print(d)