Python 如何定义自定义标记'；s在BeautifulSoup的酒店？_Python_Html_Parsing_Beautifulsoup_Sgml

Python 如何定义自定义标记'；s在BeautifulSoup的酒店？

python html parsing

Python 如何定义自定义标记'；s在BeautifulSoup的酒店？,python,html,parsing,beautifulsoup,sgml,Python,Html,Parsing,Beautifulsoup,Sgml,我有一个SGML文件，它混合了需要关闭和不需要关闭的标记。BeautifulSoup可以美化HTML，但我的标记是自定义的，BeautifulSoup只在文件末尾关闭它们。以下是消息来源： from bs4 import BeautifulSoup import requests url = 'https://www.sec.gov/Archives/edgar/data/1122304/000119312515118890/0001193125-15-118890.hdr.sgml' sgm

我有一个SGML文件，它混合了需要关闭和不需要关闭的标记。BeautifulSoup可以美化HTML，但我的标记是自定义的，BeautifulSoup只在文件末尾关闭它们。以下是消息来源：

from bs4 import BeautifulSoup
import requests

url = 'https://www.sec.gov/Archives/edgar/data/1122304/000119312515118890/0001193125-15-118890.hdr.sgml'
sgml = requests.get(url).text
soup = BeautifulSoup(sgml, 'html5lib')

以下是：

0001193125-15-118890.hdr.sgml:20150403
20150403143902
0001193125-15-118890
DEF14A
37
20150515
20150403
20150403
20150403
安泰公司/宾夕法尼亚州/
0001122304
6324
232229683
帕
1231
...

其中，

FILER

和

COMPANY-DATA

需要结束标记，而其他人则不需要

如何让BeautifulSoup的解析器关闭行末尾的某些标记？这是否与BS如何处理

br

和

li

与

和

div

有关？

我在BeautifulSoup中找不到如何控制树生成器。我刚刚用正则表达式关闭了打开的标记（正如@ChristosPapoulas所建议的），最后得到了一个XML文件

在我的问题代码中添加：

# Find all tags
all_tags = re.findall(
    r'<([^>/]+)>',
    sgml
)

# Find closed tags
closed_tags = re.findall(
    r'</([^>]+)>',
    sgml
)

# Deduce open tags
open_tags = [x for x in all_tags if x not in closed_tags]

# Closing open tags knowing that each of them takes just one line
sgml_xml = re.sub(
    r'(<({})>.*)'.format('|'.join(open_tags)),
    r'\1</\2>',
    sgml
)

#查找所有标记
所有标签=re.findall(
r'/]+）>'，
sgml
)
#查找关闭的标记
闭合标签=re.findall(
r']+）>'，
sgml
)
#推断开放标签
打开\u标记=[x代表所有\u标记中的x，如果x不在关闭的\u标记中]
#关闭打开的标记，知道每个标记只占用一行
sgml_xml=re.sub(
r'（.*）.format（'|'.join（open_标记）），
r'\1',，
sgml
)

仍然对如何在树生成器中操作标记属性感到好奇。

BeautifulSoup正在解析和提取格式不正确的HTML/XML中的数据，但如果损坏的HTML不明确，它将使用一组规则来解释标记。这是你不想要的东西。为什么不使用正则表达式而不是BeautifulSoup来解析文件？@ChristosPapoulas对于自定义标记，BeautifulSoup在构造函数中有

selfClosingTags

参数（

BeautifulSoup（）

）。它不在那里。例如，见。BS4说“树生成器负责理解自动关闭标记”，但是如何在那里设置它们呢？你可能会感兴趣。

# Find all tags
all_tags = re.findall(
    r'<([^>/]+)>',
    sgml
)

# Find closed tags
closed_tags = re.findall(
    r'</([^>]+)>',
    sgml
)

# Deduce open tags
open_tags = [x for x in all_tags if x not in closed_tags]

# Closing open tags knowing that each of them takes just one line
sgml_xml = re.sub(
    r'(<({})>.*)'.format('|'.join(open_tags)),
    r'\1</\2>',
    sgml
)