Python 美丽的汤处理无效标签和自动关闭标签_Python_Html_Web Scraping_Beautifulsoup

Python 美丽的汤处理无效标签和自动关闭标签

python html web-scraping

Python 美丽的汤处理无效标签和自动关闭标签,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我正在尝试使用BeautifulUp提取html标记并删除文本。例如，以html为例： html_page = """ <html> <body> <table> <tr class=tb1><td>Lorem Ipsum dolor Sit amet</td></tr> <tr class=tb1><td>Consectetuer adipiscing elit</td>&l

我正在尝试使用BeautifulUp提取html标记并删除文本。例如，以html为例：

html_page = """
<html>
<body>
<table>
<tr class=tb1><td>Lorem Ipsum dolor Sit amet</td></tr>
<tr class=tb1><td>Consectetuer adipiscing elit</td></tr>
<tr><td>Aliquam Tincidunt mauris eu Risus</td></tr>
<tr><td>Vestibulum Auctor Dapibus neque</td></tr>
</table>
</body>
</html>
"""

这很有效，直到。。。您必须处理html5、html4和xhtml之间的无效标记和自动关闭标记变化。例如，

和

都应该作为

和

输出，但是上面的代码会产生

。我不确定是否需要修改代码，或者问题是否在于应该使用哪个解析器。

重复的。使用xml解决方案。重复的。使用xml解决方案。

<html>
<body>
<table>
<tr><td></td></tr>
<tr><td></td></tr>
<tr><td></td></tr>
<tr><td></td></tr>
</table>
</body>
</html>

def get_tags(soup, parser):
copy_soup = soup
for element in copy_soup.find_all():
    if not element.find(recursive=False):
        element.string = ' ' # removes text from within tags
    element.attrs = {} # removes all tag parameters
#return str(copy_soup).split()
return copy_soup.prettify()

print get_tags(soup)