Python 对于单标记xml,无法正确读取xml
我使用pandas\u read\u xml包来读取xml文件并将其处理到pandas数据框中。在绝大多数情况下,该软件包对我来说绝对有效。但是,当读取只有一个标记的url时,dataframe输出有点关闭。让我用以下两个例子来说明这一点Python 对于单标记xml,无法正确读取xml,python,pandas,xml-parsing,Python,Pandas,Xml Parsing,我使用pandas\u read\u xml包来读取xml文件并将其处理到pandas数据框中。在绝大多数情况下,该软件包对我来说绝对有效。但是,当读取只有一个标记的url时,dataframe输出有点关闭。让我用以下两个例子来说明这一点 # Import package import pandas_read_xml as pdx from pandas_read_xml import fully_flatten # Example 1 url_1 = ‘https://www.sec.gov
# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
# Example 1
url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml’
df_1 = pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_1 = pdx.fully_flatten(df_1)
结果df_1包含163行和31列,其中每行对应一个唯一的安全性。这与我期望的结果相符。但是,当我尝试读取一个只出现一次标记“invstOrSec”的xml时,输出有点奇怪
# Example 2
url_2 = ‘https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml’
df_2 = pdx.read_xml(url_2,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_2 = pdx.fully_flatten(df_2)
结果df_2包含6行19列。我真的不明白为什么它包含6行,而实际上它应该是一行。我观察到,这种行为只发生在标记“invstOrSec”只出现一次的情况下。在此方面的任何帮助都将不胜感激。如果我的问题不清楚,请告诉我。首先,感谢您的反馈!我编写pandas read xml是因为pandas没有pd.read_xml()实现。您(和我们其他人)将会很高兴地知道pandas read_xml的开发版本即将发布!() 至于您当前的难题,这是XML结构的一个结果(也是我许多不喜欢的结果之一)。与JSON不同,JSON可以在列表中返回单个元素,XML结构只有一个XML标记,它被解释为单个值而不是列表 本质上,如果只有一个“行”标记,那么“列”标记现在被视为列标记。。。我没什么道理,是吗?让我用你的例子来解释
# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
# Example 1
url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml’
df_1 = pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_1 = pdx.fully_flatten(df_1)
以下是我建议您如何使用它:
#导入包
将xml作为pdx导入
从pandas\u读取\u xml导入完全\u展平
#例1
url_1='2〕https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml'
df_1=pdx.read_xml(url_1,['edgarSubmission','formData','invstOrSecs','invstOrSecs'])。管道(完全展平)
#例2
url_2=”https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df_2=pdx.read_xml(url_2,['edgarSubmission','formData','invstOrSecs',,transpose=True)。管道(完全展平)
df_2
有什么区别
在示例1中,您已经期望标记中有多个。
因此,传递root_tag_list=['edgarSubmission'、'formData'、'invstOrSecs'、'invstOrSecs']会在引擎盖下返回一个列表。完全展平过程首先将列表分解为行
在示例2中,如果使用相同的根标签列表,则pandas不会在列表中读取。相反,它是在对应于单行的词典中阅读。实际上,它将标记视为要作为行的列。相反,我会在它上面传递一个标记作为根标记,然后转置它,然后完全展平
是的。。。我知道。。。这是一个解决办法。但是再说一次,我并不是为了解决所有问题而创建xml的。在pandas本机支持读取XML(看起来很快就会出现)之前,它一直是一个临时解决方案
让我知道进展如何
编辑:
关于如何使XML-to-DataFrame转换能够根据XML是否只有一个“行”标记或多个标记进行切换,我有以下两个选项
在多行情况下,DataFrame将生成具有整数索引(行号)的DataFrame,而在单行情况下,DataFrame索引将是表示为列的“字符串”。因此,一种策略是检测到这一点并相应地重新做。(你可以用更聪明的方法避免重复下载)
另一种选择是使用底层工具。pandas_read_xml背后没有魔力,它使用了一个名为xmltodict的包。阅读XML,转换为dicts,然后转换为pandas,然后展平。唯一的缺点是,由于保留了标记“invstOrSec”的名称,因此它们成为列名的前缀。你应该能够很容易地移除这些
# Import package
import pandas as pd
import pandas_read_xml as pdx
import xmltodict
from pandas_read_xml import fully_flatten
# Example 4
url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
xmldicts = []
for url_component in url_components:
url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
xml = pdx.read_xml_from_url(url)
xmldicts.append(xmltodict.parse(xml)['edgarSubmission']['formData']['invstOrSecs'])
df = pd.DataFrame.from_dict(xmldicts).pipe(fully_flatten)
df
希望有帮助
编辑:
因此,我更新了包(现在是0.2.0版)。现在pandas_read_xml应该将根标记作为结果pandas dataframe中的行作为默认值,因此无需区分有时具有单个“行”有时具有多行的xml
如果这在其他情况下是一个问题,那么有一个新的参数
root\u is\u rows
,默认为True,但可以设置为False。实际上,在即将发布的Pandas 1.3中,read\u xml
将允许您将解析的节点迁移到数据帧中。但是,由于XML可以有许多维度,超出了按行逐列的二维范围,如前所述:
此方法最适合导入浅层XML文档
因此,不会立即拾取任何嵌套元素,如图所示,其中包含大约20列。请注意,由于文档中的默认名称空间,需要使用名称空间
熊猫1.3+
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... fairValLevel securityLending assetCat debtSec
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... 3.0 NaN None NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 2.0 NaN ABS-CBDO NaN
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... 3.0 NaN EP NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN NaN None NaN
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN NaN None NaN
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN NaN None NaN
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN NaN None NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN NaN None NaN
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN NaN None NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN NaN None NaN
# [163 rows x 20 columns]
url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})
print(df)
# name lei title cusip ... invCountry isRestrictedSec fairValLevel securityLending
# 0 Salient Private Access Master Fund, L.P. NaN Salient Private Access Master Fund, L.P. 999999999 ... US Y NaN NaN
# [1 rows x 18 columns]
幸运的是,read\uxml
支持XSLT(设计用于转换xml文档的专用语言)和默认的lxml
解析器。使用XSLT,您可以展开迁移所需的节点以检索32列
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", namespaces={"edgar": "http://www.sec.gov/edgar/nport"},
stylesheet=xsl)
print(df)
# name lei title cusip ... annualizedRt isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. NaN Tastemade Inc. 999999999 ... NaN None None None
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... NaN Regatta XV Funding Ltd., Subordinated Note, Pr... 75888PAC7 ... 0.0624 N N N
# 2 Hired, Inc., Series C Preferred Stock NaN Hired, Inc., Series C Preferred Stock NaN ... NaN None None None
# 3 WESTVIEW CAPITAL PARTNERS II LP NaN WESTVIEW CAPITAL PARTNERS II LP 999999999 ... NaN None None None
# 4 VOYAGER CAPITAL FUND III, L.P. NaN VOYAGER CAPITAL FUND III, L.P. 999999999 ... NaN None None None
.. ... ... ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. NaN ARCLIGHT ENERGY PARTNERS FUND V, L.P. 999999999 ... NaN None None None
# 159 ALLOY MERCHANT PARTNERS L.P. NaN ALLOY MERCHANT PARTNERS L.P. 999999999 ... NaN None None None
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... 999999999 ... NaN None None None
# 161 ABRY ADVANCED SECURITIES FUND LP NaN ABRY ADVANCED SECURITIES FUND LP 999999999 ... NaN None None None
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... NaN ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... 999999999 ... NaN None None None
# [163 rows x 32 columns]
只是想补充一下。。。似乎没有一种构造XML的“标准”方式。对于您遇到的每个新数据源,您可能需要对这些内容进行修补。感谢您的详细解释。您所建议的方法可以实现。但是,它要求我首先识别带有单个或多个“invstOrSecs”实例的URL,然后使用这两种方法之一将它们转换为数据帧。我有几千个URL需要解析,目前我正在for循环中进行解析。您知道我是否可以定义一个参数,使我能够筛选出出现单个或多个“invstOrSecs”的情况,以便我仍然可以解析
import lxml.etree as lx
import pandas as pd
import urllib.request as rq
url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:edgar="http://www.sec.gov/edgar/nport">
<xsl:output method="xml" indent="yes" />
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="edgar:invstOrSec">
<xsl:copy>
<xsl:apply-templates select="*|*/*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
"""
content = rq.urlopen(url)
# LOAD XML AND XSL
doc = lx.fromstring(content.read())
style = lx.fromstring(xsl)
# INITIALIZE AND TRANSFORM ORIGINAL DOC
transformer = lx.XSLT(style)
result = transformer(doc)
# RUN XPATH PARSING ON FLATTER XML
data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
} for inv in result.xpath("//edgar:invstOrSec",
namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]
# BIND DATA FOR DATA FRAME
df = pd.DataFrame(data)
print(df)
# name lei title ... isDefault areIntrstPmntsInArrs isPaidKind
# 0 Tastemade Inc. N/A Tastemade Inc. ... NaN NaN NaN
# 1 Regatta XV Funding Ltd., Subordinated Note, Pr... N/A Regatta XV Funding Ltd., Subordinated Note, Pr... ... N N N
# 2 Hired, Inc., Series C Preferred Stock N/A Hired, Inc., Series C Preferred Stock ... NaN NaN NaN
# 3 WESTVIEW CAPITAL PARTNERS II LP N/A WESTVIEW CAPITAL PARTNERS II LP ... NaN NaN NaN
# 4 VOYAGER CAPITAL FUND III, L.P. N/A VOYAGER CAPITAL FUND III, L.P. ... NaN NaN NaN
# .. ... ... ... ... ... ... ...
# 158 ARCLIGHT ENERGY PARTNERS FUND V, L.P. N/A ARCLIGHT ENERGY PARTNERS FUND V, L.P. ... NaN NaN NaN
# 159 ALLOY MERCHANT PARTNERS L.P. N/A ALLOY MERCHANT PARTNERS L.P. ... NaN NaN NaN
# 160 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ... ... NaN NaN NaN
# 161 ABRY ADVANCED SECURITIES FUND LP N/A ABRY ADVANCED SECURITIES FUND LP ... NaN NaN NaN
# 162 ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... N/A ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F... ... NaN NaN NaN
# [163 rows x 32 columns]