Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/ruby-on-rails/52.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 将所有嵌套的xml信息表转换为Python字典_Python 3.x_Xml_Parsing - Fatal编程技术网

Python 3.x 将所有嵌套的xml信息表转换为Python字典

Python 3.x 将所有嵌套的xml信息表转换为Python字典,python-3.x,xml,parsing,Python 3.x,Xml,Parsing,我有这个xml(在这篇文章的底部)。我首先从URL获取xml 然后,我想将每个infoTable中的信息解析到python字典中 e、 g 现在我正在这样做: import traceback import urllib3 import xmltodict url = "https://www.sec.gov/Archives/edgar/data/1067983/000095012320012127/0000950123-20-012127-1653.xml" htt

我有这个xml(在这篇文章的底部)。我首先从URL获取xml

然后,我想将每个infoTable中的信息解析到python字典中

e、 g

现在我正在这样做:


import traceback
import urllib3
import xmltodict

url = "https://www.sec.gov/Archives/edgar/data/1067983/000095012320012127/0000950123-20-012127-1653.xml"

http = urllib3.PoolManager()

response = http.request('GET', url)

try:
    data = xmltodict.parse(response.data)
    unordered_dict = dict(data['informationTable'])
except:
    print("Failed to parse xml from response (%s)" % traceback.format_exc())

print(unordered_dict)

但是,这使得父InformationTable成为字典,所有子InfoTable都作为有序字典嵌套在其中。我不确定解决这个问题的最佳方法是什么

供参考的XML:

<informationTable xmlns="http://www.sec.gov/edgar/document/thirteenf/informationtable">
    <infoTable>
        <nameOfIssuer>COMPANY NAME</nameOfIssuer>
        <titleOfClass>COM</titleOfClass>
        <cusip>000034324</cusip>
        <value>100</value>
        <shrsOrPrnAmt>
            <sshPrnamt>9000</sshPrnamt>
            <sshPrnamtType>SH</sshPrnamtType>
        </shrsOrPrnAmt>
            <investmentDiscretion>DFND</investmentDiscretion>
            <otherManager>1,2</otherManager>
        <votingAuthority>
            <Sole>10000</Sole>
            <Shared>0</Shared>
            <None>0</None>
        </votingAuthority>
    </infoTable>
    <infoTable>
        <nameOfIssuer>COMPANY NAME 2</nameOfIssuer>
        <titleOfClass>COM</titleOfClass>
        <cusip>020002101</cusip>
        <value>86663</value>
        <shrsOrPrnAmt>
            <sshPrnamt>50000</sshPrnamt>
            <sshPrnamtType>SH</sshPrnamtType>
        </shrsOrPrnAmt>
            <investmentDiscretion>DFND</investmentDiscretion>
            <otherManager>1,2</otherManager>
        <votingAuthority>
            <Sole>10000</Sole>
            <Shared>0</Shared>
            <None>0</None>
        </votingAuthority>
    </infoTable>

公司名称
组件对象模型
000034324
100
9000
嘘
DFND
1,2
10000
0
0
公司名称2
组件对象模型
020002101
86663
50000
嘘
DFND
1,2
10000
0
0
虽然我非常喜欢xml(我非常喜欢!)和字典,但有时为了更易于阅读,在使用python时,您必须求助于表-数据帧。问题中EDGAR文件中的xml是一个很好的例子:它只有145个条目,但有2467行xml!类似于
titleOfClass
的字符串,当列标题在xml中出现290次时,该字符串应出现一次

所以,是的,一张桌子是一种方式。通常您会使用pandas的
read_html()
,但它不适用于xml。您还可以加载data intro lxml并手动提取数据并将其转换为数据帧。但在这种特殊情况下,我将使用,这大大简化了流程:

import pandas_read_xml as pdx
url = 'https://www.sec.gov/Archives/edgar/data/1067983/000095012320012127/0000950123-20-012127-1653.xml'
df = pdx.read_xml(url,['informationTable','infoTable'])

# the next two enties are used to expand these two columns into their components
df2 = pd.json_normalize(df['shrsOrPrnAmt'])
df3 = pd.json_normalize(df['votingAuthority'])

df_list = [df,df2,df3]
to_drop = ['shrsOrPrnAmt','votingAuthority'] # now that we expanded them, we don't these 2 columns anymore
final_df = pd.concat(df_list, axis = 1).drop(to_drop, axis = 1)
final_df
输出是您的表。

我非常喜欢xml(我也非常喜欢!)和字典,有时为了更易于阅读,在使用python时,您必须求助于表-数据帧。问题中EDGAR文件中的xml是一个很好的例子:它只有145个条目,但有2467行xml!类似于
titleOfClass
的字符串,当列标题在xml中出现290次时,该字符串应出现一次

所以,是的,一张桌子是一种方式。通常您会使用pandas的
read_html()
,但它不适用于xml。您还可以加载data intro lxml并手动提取数据并将其转换为数据帧。但在这种特殊情况下,我将使用,这大大简化了流程:

import pandas_read_xml as pdx
url = 'https://www.sec.gov/Archives/edgar/data/1067983/000095012320012127/0000950123-20-012127-1653.xml'
df = pdx.read_xml(url,['informationTable','infoTable'])

# the next two enties are used to expand these two columns into their components
df2 = pd.json_normalize(df['shrsOrPrnAmt'])
df3 = pd.json_normalize(df['votingAuthority'])

df_list = [df,df2,df3]
to_drop = ['shrsOrPrnAmt','votingAuthority'] # now that we expanded them, we don't these 2 columns anymore
final_df = pd.concat(df_list, axis = 1).drop(to_drop, axis = 1)
final_df

输出是您的表。

考虑使用内置的
etree
之类的兼容DOM库解析XML响应。然后,运行list/dict comprehension,为所有
节点(子节点和子节点)构建字典列表。一个主要问题是XML使用默认名称空间,需要定义前缀才能按命名节点进行解析

导入回溯
导入urllib3
将xml.etree.ElementTree作为et#PYTHON标准库导入
从pprint导入pprint#PYTHON标准库
url=”https://www.sec.gov/Archives/edgar/data/1067983/000095012320012127/0000950123-20-012127-1653.xml"
http=urllib3.PoolManager()
response=http.request('GET',url)
doc=et.fromstring(response.data)
nmsp={“文件”:http://www.sec.gov/edgar/document/thirteenf/informationtable"}
尝试:
#子节点和子节点的合并DICT方法(Python 3.5+)
#使用split()删除标记名中的命名空间
data_dicts=[{**{ch.tag.split('}')[1]:ch.text.strip()用于el.findall(“./*/*”)中的ch,
**{t.tag.split('}')[1]:el.findall(“*”)中t的t.text.strip()
}对于doc.findall中的el(“.//doc:infoTable”,名称空间=nmsp)]
除:
打印(f“未能从响应({traceback.format_exc()})解析xml”)
pprint(len(数据指令))
# 145
pprint(数据目录)
输出

[{'None':'0',
“共享”:“0”,
“唯一”:“21264316”,
“cusip”:“00287Y109”,
“投资自由裁量权”:“DFND”,
“发行人姓名”:“ABBVIE公司”,
“其他经理”:“4,11”,
“shrsOrPrnAmt”:“,
‘sshPrnamt’:‘21264316’,
“SSHPRNAMTYPE”:“SH”,
“titleOfClass”:“COM”,
“值”:“1862541”,
“投票权”:“}”,
{'None':'0',
“共享”:“0”,
“唯一”:“419500”,
“cusip”:“023135106”,
“投资自由裁量权”:“DFND”,
“发行人姓名”:“亚马逊公司”,
“otherManager”:“4”,
“shrsOrPrnAmt”:“,
‘sshPrnamt’:‘419500’,
“SSHPRNAMTYPE”:“SH”,
“titleOfClass”:“COM”,
“值”:“1320892”,
“投票权”:“}”,
{'None':'0',
“共享”:“0”,
“唯一”:“113800”,
“cusip”:“023135106”,
“投资自由裁量权”:“DFND”,
“发行人姓名”:“亚马逊公司”,
“其他经理”:“4,8,11”,
“shrsOrPrnAmt”:“,
‘sshPrnamt’:‘113800’,
“SSHPRNAMTYPE”:“SH”,
“titleOfClass”:“COM”,
“值”:“358325”,
“投票权”:“
...
{'None':'0',
“共享”:“0”,
“唯一”:“1625185”,
“cusip”:“G9001E102”,
“投资自由裁量权”:“DFND”,
“发行人姓名”:“自由拉丁美洲有限公司”,
“其他经理”:“4,8,11”,
“shrsOrPrnAmt”:“,
‘sshPrnamt’:‘1625185’,
“SSHPRNAMTYPE”:“SH”,
“类别标题”:“COM CL A”,
“值”:“13408”,
“投票权”:“}”,
{'None':'0',
“共享”:“0”,
“唯一”:“146177”,
“cusip”:“G9001E128”,
“投资自由裁量权”:“DFND”,
“发行人姓名”:“自由拉丁美洲有限公司”,
“otherManager”:“4”,
“shrsOrPrnAmt”:“,
‘sshPrnamt’:‘146177’,
“SSHPRNAMTYPE”:“SH”,
“类别标题”:“COM CL C”,
“值”:“1190”,
“投票权”:“}”,
{'None':'0',
“共享”:“0”,
“唯一”:“1284020”,
“cusip”:“G9001E128”,
“投资自由裁量权”:“DFND”,
“发行人姓名”:“自由拉丁美洲有限公司”,
“其他经理”:“4,8,11”,
“shrsOrPrnAmt”:“,
‘sshPrnamt’:‘1284020’,
“SSHPRNAMTYPE”:“SH”,
“类别标题”:“COM CL C”,
“值”:“10452”,
“投票权”:“}]

考虑使用兼容的DOM库(如内置的
etree
)解析XML响应。然后,运行list/dict comprehension,为所有
节点(子节点和子节点)构建字典列表