Python 使用BeautifulSoup在HTML注释之间提取文本
使用Python3和Beautifulsoup4,我希望能够从HTML页面中提取文本,该页面仅由上面的注释描述。例如:Python 使用BeautifulSoup在HTML注释之间提取文本,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,使用Python3和Beautifulsoup4,我希望能够从HTML页面中提取文本,该页面仅由上面的注释描述。例如: <\!--UNIQUE COMMENT--> I would like to get this text <\!--SECOND UNIQUE COMMENT--> I would also like to find this text 我想得到这个文本 我也想找到这篇文章 我已经找到了各种方法来提取页面的文本或评论,但没有办法做到我想要的。任何帮
<\!--UNIQUE COMMENT-->
I would like to get this text
<\!--SECOND UNIQUE COMMENT-->
I would also like to find this text
我想得到这个文本
我也想找到这篇文章
我已经找到了各种方法来提取页面的文本或评论,但没有办法做到我想要的。任何帮助都将不胜感激。Python的
bs4
模块有一个类。您可以使用它来提取注释
from bs4 import BeautifulSoup, Comment
html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
您只需遍历所有可用的注释,查看它是否是您所需的条目之一,然后显示以下元素的文本,如下所示:
from bs4 import BeautifulSoup, Comment
html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
for comment in soup.findAll(text=lambda text:isinstance(text, Comment)):
if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']:
print comment.next_element.strip()
Martin答案的改进-您可以直接搜索特定注释-无需迭代所有注释,然后检查值-一次性完成:
comments_to_search_for = {'UNIQUE COMMENT', 'SECOND UNIQUE COMMENT'}
for comment in soup.find_all(text=lambda text: isinstance(text, Comment) and text in comments_to_search_for):
print(comment.next_element.strip())
印刷品:
I would like to get this text
I would also like to find this text
我认为OP是试图在注释之间提取文本,而不是注释本身。
我想得到这个文本
-这个?是的,那个。我可以很好地提取评论。我刚才正要这么做+这正是我所需要的。非常感谢你。
I would like to get this text
I would also like to find this text