Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/xml/12.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从url响应(文本文件)中解析XML时出错,HTML块处于启动状态_Python_Xml_Edgar - Fatal编程技术网

Python 从url响应(文本文件)中解析XML时出错,HTML块处于启动状态

Python 从url响应(文本文件)中解析XML时出错,HTML块处于启动状态,python,xml,edgar,Python,Xml,Edgar,我正试图从SEC Edgar的数据库中抓取文件。我可以使用请求获取文本文件。当我尝试使用以下代码解析文件时,我得到了解析错误。当我请求.xml url而不是.txt url时,同样的代码也可以工作。 Url包含以下内容: <SEC-HEADER>0001752724-20-203989.hdr.sgml : 20201001 <ACCEPTANCE-DATETIME>20201001132951 ACCESSION NUMBER: 0001752724-20-

我正试图从SEC Edgar的数据库中抓取文件。我可以使用请求获取文本文件。当我尝试使用以下代码解析文件时,我得到了解析错误。当我请求.xml url而不是.txt url时,同样的代码也可以工作。 Url包含以下内容:

<SEC-HEADER>0001752724-20-203989.hdr.sgml : 20201001
<ACCEPTANCE-DATETIME>20201001132951
ACCESSION NUMBER:       0001752724-20-203989
CONFORMED SUBMISSION TYPE:  NPORT-P
PUBLIC DOCUMENT COUNT:      2
CONFORMED PERIOD OF REPORT: 20200831
FILED AS OF DATE:       20201001
PERIOD START:               20201130

-------------
**
-------------
    FORMER COMPANY: 
        FORMER CONFORMED NAME:  ASA LTD
        DATE OF NAME CHANGE:    20070301

    FORMER COMPANY: 
        FORMER CONFORMED NAME:  ASA BERMUDA LTD
        DATE OF NAME CHANGE:    20030505
</SEC-HEADER>
<DOCUMENT>
<TYPE>NPORT-P
<SEQUENCE>1
<FILENAME>primary_doc.xml
<TEXT>
<XML>
<?xml version="1.0" encoding="UTF-8"?><edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sec.gov/edgar/nport eis_NPORT_Filer.xsd">
  <headerData>
    <submissionType>NPORT-P</submissionType>
    <isConfidential>false</isConfidential>
    <filerInfo>

      <filer>
        <issuerCredentials>
          <cik>0001230869</cik>
          <ccc>XXXXXXXX</ccc>
错误:

Traceback (most recent call last):

  File "/usr/local/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-83-cd4e6ed59b34>", line 3, in <module>
    root = ET.fromstring(response.content)

  File "/usr/local/anaconda/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
    parser.feed(text)

  File "<string>", line unknown
ParseError: not well-formed (invalid token): line 14, column 38
回溯(最近一次呼叫最后一次):
文件“/usr/local/anaconda/lib/python3.6/site packages/IPython/core/interactiveshell.py”,第3326行,运行代码
exec(代码对象、self.user\u全局、self.user\n)
文件“”,第3行,在
root=ET.fromstring(response.content)
文件“/usr/local/anaconda/lib/python3.6/xml/etree/ElementTree.py”,第1314行,xml格式
parser.feed(文本)
文件“”,行未知
ParseError:格式不正确(无效令牌):第14行第38列

当我请求.xml url而不是.txt url时,同样的代码也会起作用。所以,当您请求一个文本文件时,它不能被解析为XML文件,您会感到惊讶吗?我希望它能够与文本版本一起工作,因为并非所有url都有.XML版本可用。请参考问题了解更多信息,这里没有令人惊讶的地方。您不能对非XML的数据使用XML工具或解析器。您发布的不是XML。(这就是ParseError:格式不正确(无效标记)告诉您的。)您可以扫描XML声明
,然后从那里到文件末尾,您可以提取格式正确的XML文件(但我们不能肯定,因为您只发布了数据顶部的一部分)。这实际上起到了作用。我能够提取XML HTML标记之间的XML部分,并将其加载到XML解析器中。
Traceback (most recent call last):

  File "/usr/local/anaconda/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-83-cd4e6ed59b34>", line 3, in <module>
    root = ET.fromstring(response.content)

  File "/usr/local/anaconda/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
    parser.feed(text)

  File "<string>", line unknown
ParseError: not well-formed (invalid token): line 14, column 38