Python 对于单标记xml,无法正确读取xml

Python 对于单标记xml,无法正确读取xml,python,pandas,xml-parsing,Python,Pandas,Xml Parsing,我使用pandas\u read\u xml包来读取xml文件并将其处理到pandas数据框中。在绝大多数情况下,该软件包对我来说绝对有效。但是,当读取只有一个标记的url时,dataframe输出有点关闭。让我用以下两个例子来说明这一点 # Import package import pandas_read_xml as pdx from pandas_read_xml import fully_flatten # Example 1 url_1 = ‘https://www.sec.gov

我使用pandas\u read\u xml包来读取xml文件并将其处理到pandas数据框中。在绝大多数情况下,该软件包对我来说绝对有效。但是,当读取只有一个标记的url时,dataframe输出有点关闭。让我用以下两个例子来说明这一点

# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 1
url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml’
df_1 =  pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_1 = pdx.fully_flatten(df_1)
结果df_1包含163行和31列,其中每行对应一个唯一的安全性。这与我期望的结果相符。但是,当我尝试读取一个只出现一次标记“invstOrSec”的xml时,输出有点奇怪

# Example 2
url_2 = ‘https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml’
df_2  = pdx.read_xml(url_2,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_2 = pdx.fully_flatten(df_2)

结果df_2包含6行19列。我真的不明白为什么它包含6行,而实际上它应该是一行。我观察到,这种行为只发生在标记“invstOrSec”只出现一次的情况下。在此方面的任何帮助都将不胜感激。如果我的问题不清楚,请告诉我。

首先,感谢您的反馈!我编写pandas read xml是因为pandas没有pd.read_xml()实现。您(和我们其他人)将会很高兴地知道pandas read_xml的开发版本即将发布!()

至于您当前的难题,这是XML结构的一个结果(也是我许多不喜欢的结果之一)。与JSON不同,JSON可以在列表中返回单个元素,XML结构只有一个XML标记,它被解释为单个值而不是列表

本质上,如果只有一个“行”标记,那么“列”标记现在被视为列标记。。。我没什么道理,是吗?让我用你的例子来解释

# Import package
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

# Example 1
url_1 = ‘https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml’
df_1 =  pdx.read_xml(url_1,['edgarSubmission', 'formData','invstOrSecs', 'invstOrSec'])
df_1 = pdx.fully_flatten(df_1)
以下是我建议您如何使用它:

#导入包
将xml作为pdx导入
从pandas\u读取\u xml导入完全\u展平
#例1
url_1='2〕https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml'
df_1=pdx.read_xml(url_1,['edgarSubmission','formData','invstOrSecs','invstOrSecs'])。管道(完全展平)
#例2
url_2=”https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df_2=pdx.read_xml(url_2,['edgarSubmission','formData','invstOrSecs',,transpose=True)。管道(完全展平)
df_2
有什么区别

在示例1中,您已经期望标记中有多个。 因此,传递root_tag_list=['edgarSubmission'、'formData'、'invstOrSecs'、'invstOrSecs']会在引擎盖下返回一个列表。完全展平过程首先将列表分解为行

在示例2中,如果使用相同的根标签列表,则pandas不会在列表中读取。相反,它是在对应于单行的词典中阅读。实际上,它将标记视为要作为行的列。相反,我会在它上面传递一个标记作为根标记,然后转置它,然后完全展平

是的。。。我知道。。。这是一个解决办法。但是再说一次,我并不是为了解决所有问题而创建xml的。在pandas本机支持读取XML(看起来很快就会出现)之前,它一直是一个临时解决方案

让我知道进展如何

编辑:

关于如何使XML-to-DataFrame转换能够根据XML是否只有一个“行”标记或多个标记进行切换,我有以下两个选项

在多行情况下,DataFrame将生成具有整数索引(行号)的DataFrame,而在单行情况下,DataFrame索引将是表示为列的“字符串”。因此,一种策略是检测到这一点并相应地重新做。(你可以用更聪明的方法避免重复下载)

另一种选择是使用底层工具。pandas_read_xml背后没有魔力,它使用了一个名为xmltodict的包。阅读XML,转换为dicts,然后转换为pandas,然后展平。唯一的缺点是,由于保留了标记“invstOrSec”的名称,因此它们成为列名的前缀。你应该能够很容易地移除这些

# Import package
import pandas as pd
import pandas_read_xml as pdx
import xmltodict
from pandas_read_xml import fully_flatten

# Example 4

url_components = ['1279392/000114554921008161', '1279394/000114554921008162']
xmldicts = []

for url_component in url_components:
    url = f'https://www.sec.gov/Archives/edgar/data/{url_component}/primary_doc.xml'
    xml = pdx.read_xml_from_url(url)
    xmldicts.append(xmltodict.parse(xml)['edgarSubmission']['formData']['invstOrSecs'])
    
df = pd.DataFrame.from_dict(xmldicts).pipe(fully_flatten)

df
希望有帮助

编辑:

因此,我更新了包(现在是0.2.0版)。现在pandas_read_xml应该将根标记作为结果pandas dataframe中的行作为默认值,因此无需区分有时具有单个“行”有时具有多行的xml


如果这在其他情况下是一个问题,那么有一个新的参数
root\u is\u rows
,默认为True,但可以设置为False。

实际上,在即将发布的Pandas 1.3中,
read\u xml
将允许您将解析的节点迁移到数据帧中。但是,由于XML可以有许多维度,超出了按行逐列的二维范围,如前所述:

此方法最适合导入浅层XML文档

因此,不会立即拾取任何嵌套元素,如图所示,其中包含大约20列。请注意,由于文档中的默认名称空间,需要使用
名称空间

熊猫1.3+

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", 
                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})

print(df)
#                                                   name  lei                                              title      cusip  ...  fairValLevel  securityLending  assetCat debtSec
# 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           3.0              NaN      None     NaN
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  NaN  Regatta XV Funding Ltd., Subordinated Note, Pr...  75888PAC7  ...           2.0              NaN  ABS-CBDO     NaN
# 2                Hired, Inc., Series C Preferred Stock  NaN              Hired, Inc., Series C Preferred Stock        NaN  ...           3.0              NaN        EP     NaN
# 3                      WESTVIEW CAPITAL PARTNERS II LP  NaN                    WESTVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN              NaN      None     NaN
# 4                       VOYAGER CAPITAL FUND III, L.P.  NaN                     VOYAGER CAPITAL FUND III, L.P.  999999999  ...           NaN              NaN      None     NaN
..                                                 ...  ...                                                ...        ...  ...           ...              ...       ...     ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  NaN              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  999999999  ...           NaN              NaN      None     NaN
# 159                       ALLOY MERCHANT PARTNERS L.P.  NaN                       ALLOY MERCHANT PARTNERS L.P.  999999999  ...           NaN              NaN      None     NaN
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN              NaN      None     NaN
# 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN              NaN      None     NaN
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN              NaN      None     NaN

# [163 rows x 20 columns]


url = "https://www.sec.gov/Archives/edgar/data/1279394/000114554921008162/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", 
                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})

print(df)
#                                        name  lei                                     title      cusip  ...  invCountry  isRestrictedSec fairValLevel securityLending
# 0  Salient Private Access Master Fund, L.P.  NaN  Salient Private Access Master Fund, L.P.  999999999  ...          US                Y          NaN             NaN

# [1 rows x 18 columns]
幸运的是,
read\uxml
支持XSLT(设计用于转换xml文档的专用语言)和默认的
lxml
解析器。使用XSLT,您可以展开迁移所需的节点以检索32列

xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                       xmlns:edgar="http://www.sec.gov/edgar/nport">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="edgar:invstOrSec">
        <xsl:copy>
            <xsl:apply-templates select="*|*/*"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
"""

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"
df = pd.read_xml(url, xpath="//edgar:invstOrSec", namespaces={"edgar": "http://www.sec.gov/edgar/nport"},
                 stylesheet=xsl)
print(df)
#                                                   name  lei                                              title      cusip  ...  annualizedRt  isDefault  areIntrstPmntsInArrs  isPaidKind
# 0                                       Tastemade Inc.  NaN                                     Tastemade Inc.  999999999  ...           NaN       None                  None        None
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  NaN  Regatta XV Funding Ltd., Subordinated Note, Pr...  75888PAC7  ...        0.0624          N                     N           N
# 2                Hired, Inc., Series C Preferred Stock  NaN              Hired, Inc., Series C Preferred Stock        NaN  ...           NaN       None                  None        None
# 3                      WESTVIEW CAPITAL PARTNERS II LP  NaN                    WESTVIEW CAPITAL PARTNERS II LP  999999999  ...           NaN       None                  None        None
# 4                       VOYAGER CAPITAL FUND III, L.P.  NaN                     VOYAGER CAPITAL FUND III, L.P.  999999999  ...           NaN       None                  None        None
..                                                 ...  ...                                                ...        ...  ...           ...        ...                   ...         ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  NaN              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  999999999  ...           NaN       None                  None        None
# 159                       ALLOY MERCHANT PARTNERS L.P.  NaN                       ALLOY MERCHANT PARTNERS L.P.  999999999  ...           NaN       None                  None        None
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  999999999  ...           NaN       None                  None        None
# 161                   ABRY ADVANCED SECURITIES FUND LP  NaN                   ABRY ADVANCED SECURITIES FUND LP  999999999  ...           NaN       None                  None        None
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  NaN  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  999999999  ...           NaN       None                  None        None

# [163 rows x 32 columns]

只是想补充一下。。。似乎没有一种构造XML的“标准”方式。对于您遇到的每个新数据源,您可能需要对这些内容进行修补。感谢您的详细解释。您所建议的方法可以实现。但是,它要求我首先识别带有单个或多个“invstOrSecs”实例的URL,然后使用这两种方法之一将它们转换为数据帧。我有几千个URL需要解析,目前我正在for循环中进行解析。您知道我是否可以定义一个参数,使我能够筛选出出现单个或多个“invstOrSecs”的情况,以便我仍然可以解析
import lxml.etree as lx
import pandas as pd
import urllib.request as rq

url = "https://www.sec.gov/Archives/edgar/data/1279392/000114554921008161/primary_doc.xml"

xsl = """<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                       xmlns:edgar="http://www.sec.gov/edgar/nport">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="edgar:invstOrSec">
        <xsl:copy>
            <xsl:apply-templates select="*|*/*"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
"""

content = rq.urlopen(url)

# LOAD XML AND XSL
doc = lx.fromstring(content.read())
style = lx.fromstring(xsl)

# INITIALIZE AND TRANSFORM ORIGINAL DOC
transformer = lx.XSLT(style)
result = transformer(doc)

# RUN XPATH PARSING ON FLATTER XML
data = [{node.tag.split('}')[1]:node.text for node in inv.xpath("*")
        } for inv in result.xpath("//edgar:invstOrSec", 
                                 namespaces={"edgar": "http://www.sec.gov/edgar/nport"})]

# BIND DATA FOR DATA FRAME
df = pd.DataFrame(data)

print(df)
#                                                   name  lei                                              title  ... isDefault areIntrstPmntsInArrs  isPaidKind
# 0                                       Tastemade Inc.  N/A                                     Tastemade Inc.  ...       NaN                  NaN         NaN
# 1    Regatta XV Funding Ltd., Subordinated Note, Pr...  N/A  Regatta XV Funding Ltd., Subordinated Note, Pr...  ...         N                    N           N
# 2                Hired, Inc., Series C Preferred Stock  N/A              Hired, Inc., Series C Preferred Stock  ...       NaN                  NaN         NaN
# 3                      WESTVIEW CAPITAL PARTNERS II LP  N/A                    WESTVIEW CAPITAL PARTNERS II LP  ...       NaN                  NaN         NaN
# 4                       VOYAGER CAPITAL FUND III, L.P.  N/A                     VOYAGER CAPITAL FUND III, L.P.  ...       NaN                  NaN         NaN
# ..                                                 ...  ...                                                ...  ...       ...                  ...         ...
# 158              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  N/A              ARCLIGHT ENERGY PARTNERS FUND V, L.P.  ...       NaN                  NaN         NaN
# 159                       ALLOY MERCHANT PARTNERS L.P.  N/A                       ALLOY MERCHANT PARTNERS L.P.  ...       NaN                  NaN         NaN
# 160  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  N/A  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND V-F ...  ...       NaN                  NaN         NaN
# 161                   ABRY ADVANCED SECURITIES FUND LP  N/A                   ABRY ADVANCED SECURITIES FUND LP  ...       NaN                  NaN         NaN
# 162  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  N/A  ADVENT LATIN AMERICAN PRIVATE EQUITY FUND IV-F...  ...       NaN                  NaN         NaN

# [163 rows x 32 columns]