Html 我想解析SEC文件并创建类别或每个';项目'/文本部分。我该如何考虑这样做?

Html 我想解析SEC文件并创建类别或每个';项目'/文本部分。我该如何考虑这样做?,html,regex,node.js,parsing,Html,Regex,Node.js,Parsing,我假设我有一个所需SEC文件的数据库(最初是表格10s)。大多数文件都是HTML标签;它们看起来像这样: <DOCUMENT> <TYPE>10-K <SEQUENCE>1 <FILENAME>d445434d10k.htm <DESCRIPTION>FORM 10-K <TEXT> <HTML><HEAD> <TITLE>Form 10-K</TITLE> </HEAD

我假设我有一个所需SEC文件的数据库(最初是表格10s)。大多数文件都是HTML标签;它们看起来像这样:

<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>d445434d10k.htm
<DESCRIPTION>FORM 10-K
<TEXT>
<HTML><HEAD>
<TITLE>Form 10-K</TITLE>
</HEAD>
 <BODY BGCOLOR="WHITE">
<h5 align="left"><a href="#toc">Table of Contents</a></h5>
<div style="line-height:120%;font-size:8pt;"><font style="font-family:inherit;font-size:8pt;">&#160;</font></div><div style="line-height:120%;text-indent:32px;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-style:italic;">All references in this Form 10-K to the &#8220;Company&#8221;, &#8220;Contango&#8221;, &#8220;we&#8221;, &#8220;us&#8221; or &#8220;our&#8221; are to Contango Oil&#160;&amp; Gas Company and wholly-owned Subsidiaries. Unless otherwise noted, all information in this Form 10-K relating to natural gas and oil reserves and the estimated future net cash flows attributable to those reserves are based on estimates prepared by independent engineers and are net to our interest.</font></div>
…并且能够将每个部分调用到自定义视图中(制作自定义和压缩版本;只说项目1.业务和细分信息),摆脱样板文件。我的模型将具有此文档中的类型、文件名和某些其他元数据

您将如何通过解析来以我想要的方式存储文档?根据段落的主题,将每个段落存储在一个单独的部分将是非常棒的

最后,它们中的大多数并不完全相同,但有许多共同之处。最后,这个问题不是关于XBRL或任何定量数据/表格,纯文本。我用NodeJS来做这个


感谢您的帮助

你在哪里被代码卡住了?如果它不是格式良好的HTML(XHTML),解析将是“有趣的”。Gluckle我已经花了数百个工时解析、提取和解释来自EDGAR的数据,可以向您保证,这不是一项简单的任务。10-K,K/A和10-Q,Q/A文件实际上是围绕自动生成的HTML的SGML包装。正则表达式是(或应该是)你在这类工作中最好的朋友。对于解析/识别/提取公司名称、日期、位置等信息,你需要阅读“命名实体识别”,如果你打算识别/分类业务之间的关系,你需要进行研究“词性标记”以及RDF等知识表示框架。除了@RobRaisch建议的正则表达式(或与之结合),您还可以使用like和贝叶斯逻辑。
Overview</font></div><div style="line-height:120%;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"><br></font></div><div style="line-height:120%;text-indent:48px;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;">Contango is a Houston, Texas based, independent natural gas and oil company.&#160; The Company's core business is to explore, develop, produce and acquire natural gas and oil properties offshore in the shallow waters of the Gulf of Mexico.&#160; Contango Operators, Inc. (&#8220;COI&#8221;), our wholly-owned subsidiary, acts as operator of our offshore properties.  Contango has additional onshore investments in i) Alta Resources Investments, LLC ("Alta"), whose primary area of focus is the liquids-rich Kaybob Duvernay in Alberta, Canada; ii) Exaro Energy III LLC ("Exaro"), which is primarily focused on the development of proved natural gas reserves.
Overview Contango is a Houston, Texas based, independent natural gas and oil company.&#160; The Company's core business is to explore, develop, produce and acquire natural gas and oil properties offshore in the shallow waters of the Gulf of Mexico.&#160; Contango Operators, Inc. (&#8220;COI&#8221;), our wholly-owned subsidiary, acts as operator of our offshore properties.  Contango has additional onshore investments in i) Alta Resources Investments, LLC ("Alta"), whose primary area of focus is the liquids-rich Kaybob Duvernay in Alberta, Canada; ii) Exaro Energy III LLC ("Exaro"), which is primarily focused on the development of proved natural gas reserves