用python将xml解析为数据帧_Python_Xml_Python 3.x_Pandas_Elementtree

用python将xml解析为数据帧

python xml python-3.x pandas

用python将xml解析为数据帧,python,xml,python-3.x,pandas,elementtree,Python,Xml,Python 3.x,Pandas,Elementtree,我正在尝试读取XML文件并将其转换为熊猫。但是，它返回空数据这是xml结构的示例： <Instance ID="1"> <MetaInfo StudentID ="DTSU040" TaskID="LP03_PR09.bLK.sh" DataSource="DeepTutorSummer2014"/> <ProblemDescription>A car windshield collides with a mosquito, squashing it.&l

我正在尝试读取XML文件并将其转换为熊猫。但是，它返回空数据

这是xml结构的示例：

<Instance ID="1">
<MetaInfo StudentID ="DTSU040" TaskID="LP03_PR09.bLK.sh"  DataSource="DeepTutorSummer2014"/>
<ProblemDescription>A car windshield collides with a mosquito, squashing it.</ProblemDescription>
<Question>How does this work tion?</Question>
<Answer>tthis is my best  </Answer>
<Annotation Label="correct(0)|correct_but_incomplete(1)|contradictory(0)|incorrect(0)">
<AdditionalAnnotation ContextRequired="0" ExtraInfoInAnswer="0"/>
<Comments Watch="1"> The student forgot to tell the opposite force. Opposite means opposite direction, which is important here. However, one can argue that the opposite is implied. See the reference answers.</Comments>
</Annotation>
<ReferenceAnswers>
1:  Since the windshield exerts a force on the mosquito, which we can call action, the mosquito exerts an equal and opposite force on the windshield, called the reaction.

</ReferenceAnswers>
</Instance>

解决方案中的问题是“元素数据提取”没有正确完成。您在问题中提到的xml嵌套在几个层中。这就是为什么我们需要递归地读取和提取数据。在这种情况下，下面的解决方案应该能满足您的需要。尽管我鼓励您查看和了解更多信息

方法：1

将numpy导入为np
作为pd进口熊猫
#导入操作系统
将xml.etree.ElementTree作为ET导入
def xml2df（xml_源、df_cols、source_is_file=False、show_progress=True）：
“”“解析输入XML源并将结果存储在一个文件中。”
具有给定列的DataFrame。
对于xml\u source=xml\u文件，设置：source\u is\u file=True
对于xml\u source=xml\u字符串，Set:source\u为\u file=False
儿童1文本
儿童2文本
儿童3文本
注意，对于如上所示的xml结构
可以通过列表（元素）访问元素标记。可以访问与标记关联的任何文本
可以使用访问as element.text和标记本身的名称
element.tag。
"""
如果源_是_文件：
xtree=ET.parse（xml_源）#xml_源=xml_文件
xroot=xtree.getroot（）
其他：
xroot=ET.fromstring（xml_source）#xml_source=xml_string
合并人_dict=dict（）
默认值_实例_dict={label:None for label in df_cols}
def get_children_info（子项，实例）：
#我们避免使用element.getchildren（），因为它已被弃用。
#而是使用list（元素）来获取属性列表。
对于儿童中的儿童：
#印刷品（儿童）
#打印（child.tag）
#打印（child.items（））
#print（child.getchildren（））#不推荐使用的方法
#打印（列表（子））
如果len（列表（子项））>0：
实例\u dict=获取\u子项\u信息（列表（子项），
实例（dict）
如果len（list（child.keys（））>0：
items=child.items（）
实例_dict.update（{key:items}中（key，value）的值）
#打印（child.keys（））
实例_dict.update（{child.tag:child.text}）
返回实例
#循环所有实例
例如，在列表（xroot）中：
instance\u dict=default\u instance\u dict.copy（）
ikey，ivalue=instance.items（）[0]#第一个属性是“ID”
实例目录更新（{ikey:ivalue}）
如果显示进度：
打印（“{}:{}={}.”格式（instance.tag、ikey、ivalue））
#在每个实例中循环
实例dict=获取子对象信息（列表（实例），
实例（dict）
#consolidator_dict.update（{ivalue:instance_dict.copy（）}）
consolidator_dict[ivalue]=实例_dict.copy（）
df=pd.DataFrame（合并器dict）.T
df=df[df_cols]
返回df

运行以下命令以生成所需的输出

xml\u source=r'grade\u data.xml'
df_cols=[“ID”、“TaskID”、“DataSource”、“ProblemDescription”、“Question”、“Answer”，
“ContextRequired”、“ExtraInfoInAnswer”、“Comments”、“Watch”、“ReferenceAnswers”]
df=xml2df（xml\u源，df\u cols，source\u is\u file=True）
df

方法：2 如果您有

xml\u字符串

，则可以转换

xml>>dict>>dataframe

。运行以下命令以获得所需的输出

注意：您需要安装才能使用方法2。该方法的灵感来自@martin blech在。这是我的荣幸

pip install -U xmltodict

解决方案

def递归读取（x，实例dict）：
#打印（x）
txt=“”
对于x.keys（）中的键：
k=键。替换（“@”和“”）
如果k在df_cols中：
如果isinstance（x.get（键），dict）：
instance_dict，txt=递归读取（x.get（key），instance_dict）
#其他：
实例目录更新（{k:x.get（key）}）
#打印（“{}:{}.”格式（k，x.get（key）））
其他：
#打印（'else:{}:{}'。格式（k，x.get（key）））
#如果值是另一个dict，则深入挖掘
如果isinstance（x.get（键），dict）：
instance_dict，txt=递归读取（x.get（key），instance_dict）
#添加与元素关联的简单文本
如果k='#text'：
txt=x.get（键）
#将文本更新到相应的父元素
如果（k！=“text”）和（txt！=“text”）：
实例目录更新（{k:txt}）
返回（实例dict，txt）

您将需要上述函数

read_recursive（）

。现在运行以下命令

导入xmltodict，json
o=xmltodict.parse（xml_字符串）#输入：xml_字符串
#print（json.dumps（o））#取消注释以查看xml到json转换字符串
合并dict=dict（）
oi=o['Instances']['Instance']
对于oi中的x：
实例_dict=dict（）
实例dict，递归读取（x，实例dict）
合并目录更新（{x.get（@ID”）：实例目录复制（）
df=pd.DataFrame（合并目录）.T
df=df[df_cols]
df

几个问题：

在循环变量

节点上调用.find
，需要存在一个子节点：当前\u节点。find（'child\u of \u current\u node'）
。但是，由于所有节点都是根节点的子节点，它们不维护自己的子节点，因此不需要循环
不检查NoneType
，这可能是由于使用find（）
缺少节点造成的，并阻止检索.tag
或.text
或其他属性
不使用.text
检索节点内容，否则会令人印象深刻。感谢you@mzjn谢谢你指点我
pip install -U xmltodict

rows = []

s_name = xroot.attrib.get("ID")
s_student = xroot.find("StudentID").text if xroot.find("StudentID") is not None else None
s_task = xroot.find("TaskID").text if xroot.find("TaskID") is not None else None      
s_source = xroot.find("DataSource").text if xroot.find("DataSource") is not None else None
s_desc = xroot.find("ProblemDescription").text if xroot.find("ProblemDescription") is not None else None
s_question = xroot.find("Question").text if xroot.find("Question") is not None else None    
s_ans = xroot.find("Answer").text if xroot.find("Answer") is not None else None
s_label = xroot.find("Label").text if xroot.find("Label") is not None else None
s_contextrequired = xroot.find("ContextRequired").text if xroot.find("ContextRequired") is not None else None
s_extraInfoinAnswer = xroot.find("ExtraInfoInAnswer").text if xroot.find("ExtraInfoInAnswer") is not None else None
s_comments = xroot.find("Comments").text if xroot.find("Comments") is not None else None
s_watch = xroot.find("Watch").text if xroot.find("Watch") is not None else None
s_referenceAnswers = xroot.find("ReferenceAnswers").text if xroot.find("ReferenceAnswers") is not None else None

rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task, 
             "DataSource": s_source, "ProblemDescription": s_desc , 
             "Question": s_question , "Answer": s_ans ,"Label": s_label,
             "s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer ,
             "Comments": s_comments ,  "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers     
            })

out_df = pd.DataFrame(rows, columns = df_cols)

rows = []
for node in xroot: 
    inner = {}
    inner[node.tag] = node.text

    rows.append(inner)

out_df = pd.DataFrame(rows, columns = df_cols)

rows = [{node.tag: node.text} for node in xroot]
out_df = pd.DataFrame(rows, columns = df_cols)