Python 将xml输出转换为列并将其保存到数据帧_Python_Xml_Python 3.x_Pandas

Python 将xml输出转换为列并将其保存到数据帧

python xml python-3.x pandas

Python 将xml输出转换为列并将其保存到数据帧,python,xml,python-3.x,pandas,Python,Xml,Python 3.x,Pandas,我正在尝试读取xml文件并将其转换为csv文件作为for循环的一部分，我已经设法提取了xml文件的内容。我正在尝试将其存储在列中，而不是当前存储的行中。我的数据如下所示： Date - 2019-01-01T08:00:00 ID - 5601986 Description - Product A Product Type - ProductCode - ABC ProductName - Computer RefID - X-123 Comments - 预期产出： Date,ID

我正在尝试读取xml文件并将其转换为csv文件

作为for循环的一部分，我已经设法提取了xml文件的内容。我正在尝试将其存储在列中，而不是当前存储的行中。我的数据如下所示：

Date - 2019-01-01T08:00:00
ID - 5601986
Description - Product A
Product Type - 

ProductCode - ABC
ProductName - Computer

RefID - X-123
Comments -

预期产出：

Date,ID,Description,ProductCode,ProductName,RefID,Comments
2019-01-01T08:00:00,5601986, Product A,ABC,Computer,X-123,

到目前为止我已经构建的代码：

import xml.etree.ElementTree as ET

tree = ET.parse('/users/desktop/file.xml')
root = tree.getroot()
for elem in root:
    print(elem.tag, '-', elem.text)
    for subelem in elem:
        print(subelem.tag, '-', subelem.text)

我正在尝试将其转换为数据帧以进行进一步分析

更新：

  Comments               Date Description       ID ProductCode ProductName  \
0           LoadStopConfirmed   Product A  5601986         ABC    Computer   

   RefID  
0  X-123

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse(filename)
root = tree.getroot()
final = {}
for elem in root:
    if len(elem):
        for c in elem.getchildren():
            final[c.tag] = c.text
    else:
        final[elem.tag] = elem.text

df = pd.DataFrame([final])
print(df)

包括新的xml文件：

电流输出：

Customerpin,CustomerName
XYZ,Hello

尝试：

输出：

  Comments               Date Description       ID ProductCode ProductName  \
0           LoadStopConfirmed   Product A  5601986         ABC    Computer   

   RefID  
0  X-123

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse(filename)
root = tree.getroot()
final = {}
for elem in root:
    if len(elem):
        for c in elem.getchildren():
            final[c.tag] = c.text
    else:
        final[elem.tag] = elem.text

df = pd.DataFrame([final])
print(df)

新的XML文件：

  Comments               Date Description       ID ProductCode ProductName  \
0           LoadStopConfirmed   Product A  5601986         ABC    Computer   

   RefID  
0  X-123

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse(filename)
root = tree.getroot()
final = {}
for elem in root:
    if len(elem):
        for c in elem.getchildren():
            final[c.tag] = c.text
    else:
        final[elem.tag] = elem.text

df = pd.DataFrame([final])
print(df)

尝试：

输出：

  Comments               Date Description       ID ProductCode ProductName  \
0           LoadStopConfirmed   Product A  5601986         ABC    Computer   

   RefID  
0  X-123

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse(filename)
root = tree.getroot()
final = {}
for elem in root:
    if len(elem):
        for c in elem.getchildren():
            final[c.tag] = c.text
    else:
        final[elem.tag] = elem.text

df = pd.DataFrame([final])
print(df)

新的XML文件：

  Comments               Date Description       ID ProductCode ProductName  \
0           LoadStopConfirmed   Product A  5601986         ABC    Computer   

   RefID  
0  X-123

import pandas as pd
import xml.etree.ElementTree as ET

tree = ET.parse(filename)
root = tree.getroot()
final = {}
for elem in root:
    if len(elem):
        for c in elem.getchildren():
            final[c.tag] = c.text
    else:
        final[elem.tag] = elem.text

df = pd.DataFrame([final])
print(df)

另一种方法（getchildren已弃用）：

对于大型xml，我使用收益率

import pandas as pd
import xml.etree.ElementTree as ET
import io

def iter_docs(cis):
    for docall in cis:
        doc_dict = {}
        for doc in docall:
            tag = [elem.tag for elem in doc]
            txt = [elem.text for elem in doc]
            if len(tag) > 0: doc_dict.update(dict(zip(tag, txt)))
            else:
                doc_dict[doc.tag] = doc.text
        yield doc_dict

#sample with 2 records
xml_data = io.StringIO(u'''\
<CISDocument>
    <REC>
        <Date>LoadStopConfirmed</Date>
        <ID>5601986</ID>
        <Description>Product A</Description>
        <ProductType>
            <ProductCode>ABC</ProductCode>
            <ProductName>Computer</ProductName>
        </ProductType>
        <RefID>X-123</RefID> 
        <Comments>Product A</Comments>  
    </REC>
    <REC>
        <Date>other</Date>
        <ID>5601987</ID>
        <Description>Product B</Description>
        <ProductType>
            <ProductCode>DEF</ProductCode>
            <ProductName>Computer</ProductName>
        </ProductType>
        <RefID>X-124</RefID>
        <Comments>Product B</Comments>
    </REC>
</CISDocument>
''')


etree = ET.parse(xml_data)

df = pd.DataFrame(list(iter_docs(etree.getroot())))
print(df)

如果您想将其应用于多个xml文件，只需将文件列表放在列表中并执行此操作

xml_data = "E:/test.xml"

df = pd.DataFrame()      #create the final df empty

#here i use a list of same file 
xmllist =  [xml_data, xml_data, xml_data]
for xmlfile in xmllist:
    etree = ET.parse(xmlfile).getroot()
    tmp = pd.DataFrame(list(iter_docs(etree)))
    df = df.append(tmp)

print(df)

另一种方法（getchildren已弃用）：

对于大型xml，我使用收益率

import pandas as pd
import xml.etree.ElementTree as ET
import io

def iter_docs(cis):
    for docall in cis:
        doc_dict = {}
        for doc in docall:
            tag = [elem.tag for elem in doc]
            txt = [elem.text for elem in doc]
            if len(tag) > 0: doc_dict.update(dict(zip(tag, txt)))
            else:
                doc_dict[doc.tag] = doc.text
        yield doc_dict

#sample with 2 records
xml_data = io.StringIO(u'''\
<CISDocument>
    <REC>
        <Date>LoadStopConfirmed</Date>
        <ID>5601986</ID>
        <Description>Product A</Description>
        <ProductType>
            <ProductCode>ABC</ProductCode>
            <ProductName>Computer</ProductName>
        </ProductType>
        <RefID>X-123</RefID> 
        <Comments>Product A</Comments>  
    </REC>
    <REC>
        <Date>other</Date>
        <ID>5601987</ID>
        <Description>Product B</Description>
        <ProductType>
            <ProductCode>DEF</ProductCode>
            <ProductName>Computer</ProductName>
        </ProductType>
        <RefID>X-124</RefID>
        <Comments>Product B</Comments>
    </REC>
</CISDocument>
''')


etree = ET.parse(xml_data)

df = pd.DataFrame(list(iter_docs(etree.getroot())))
print(df)

如果您想将其应用于多个xml文件，只需将文件列表放在列表中并执行此操作

xml_data = "E:/test.xml"

df = pd.DataFrame()      #create the final df empty

#here i use a list of same file 
xmllist =  [xml_data, xml_data, xml_data]
for xmlfile in xmllist:
    etree = ET.parse(xmlfile).getroot()
    tmp = pd.DataFrame(list(iter_docs(etree)))
    df = df.append(tmp)

print(df)

你能显示你的xml文件（一些元素）吗？@Frenchy我在我的第一篇文章中包含了一个xml文件的视图，希望能有所帮助。嗯，我在解析过程中出错，事实上产品类型有一个空格？@Frenchy我的错，这是我在粘贴xml脚本时的一个拼写错误。你能显示你的xml文件（一些元素）吗？@Frenchy我在我的第一篇文章中包含了一个xml文件的视图，希望有帮助。嗯，在解析过程中我有一个错误，事实上产品类型有一个空格？@Frenchy我的错，这是我粘贴xml脚本时的一个输入错误。谢谢你的帮助。我能再请你帮个忙吗。是否有一种方法可以使上述代码通用。与硬编码“ProductType”不同的是，可以将其作为变量来调用所有子节点作为循环的一部分。。再次感谢您的编辑。。我尝试使用一个新的xml文件，发现它返回子文件的输出，但是叶级别的值没有返回。。我已经发布了我试图运行此脚本的文件以及它作为我最初发布的一部分返回的输出。。看，这似乎发生在所有xml文件上，因此希望与您在同一时间进行检查。谢谢，谢谢你，这很有帮助。我能再请你帮个忙吗。是否有一种方法可以使上述代码通用。与硬编码“ProductType”不同的是，可以将其作为变量来调用所有子节点作为循环的一部分。。再次感谢您的编辑。。我尝试使用一个新的xml文件，发现它返回子文件的输出，但是叶级别的值没有返回。。我已经发布了我试图运行此脚本的文件以及它作为我最初发布的一部分返回的输出。。看，这似乎发生在所有xml文件上，因此希望与您在同一时间进行检查。谢谢。我正在设法将多个xml文件合并到一个数据帧中。你能帮我编辑一下上面的代码，这样我就可以完成这个任务了。谢谢如果我理解你，你想用同一个程序来处理更多的xml，从我的例子来看，我想我有xml_data1和xml_data2（相同的结构）并合并在同一个df中？我说在一个文件夹中有10个xml文件，我正在尝试将它们合并到一个dfok中。我只是调整循环，它并不复杂，我将xml的名称放在列表中。我试图修改第

etree=ET.parse（xml\u data）

行中定义的路径，但它抛出了错误

TypeError：应为str、bytes或os.PathLike对象，而不是list

。。所以我想知道这是否是我这边的错误。我正在试图找到一种方法将多个xml文件合并到一个数据帧中。你能帮我编辑一下上面的代码，这样我就可以完成这个任务了。谢谢如果我理解你，你想用同一个程序来处理更多的xml，从我的例子来看，我想我有xml_data1和xml_data2（相同的结构）并合并在同一个df中？我说在一个文件夹中有10个xml文件，我正在尝试将它们合并到一个dfok中。我只是调整循环，它并不复杂，我将xml的名称放在列表中。我试图修改第

etree=ET.parse（xml\u data）

行中定义的路径，但它抛出了错误

TypeError：应为str、bytes或os.PathLike对象，而不是list

。。所以我想知道这是否是我这边的问题。。