Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/357.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用BeautifulSoup在HTML注释之间提取文本_Python_Python 3.x_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 使用BeautifulSoup在HTML注释之间提取文本

Python 使用BeautifulSoup在HTML注释之间提取文本,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,使用Python3和Beautifulsoup4,我希望能够从HTML页面中提取文本,该页面仅由上面的注释描述。例如: <\!--UNIQUE COMMENT--> I would like to get this text <\!--SECOND UNIQUE COMMENT--> I would also like to find this text 我想得到这个文本 我也想找到这篇文章 我已经找到了各种方法来提取页面的文本或评论,但没有办法做到我想要的。任何帮

使用Python3和Beautifulsoup4,我希望能够从HTML页面中提取文本,该页面仅由上面的注释描述。例如:

<\!--UNIQUE COMMENT-->
I would like to get this text
<\!--SECOND UNIQUE COMMENT-->
I would also like to find this text

我想得到这个文本
我也想找到这篇文章

我已经找到了各种方法来提取页面的文本或评论,但没有办法做到我想要的。任何帮助都将不胜感激。

Python的
bs4
模块有一个类。您可以使用它来提取注释

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
comments = soup.findAll(text=lambda text:isinstance(text, Comment))

您只需遍历所有可用的注释,查看它是否是您所需的条目之一,然后显示以下元素的文本,如下所示:

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')

for comment in soup.findAll(text=lambda text:isinstance(text, Comment)):
    if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']:
        print comment.next_element.strip()

Martin答案的改进-您可以直接搜索特定注释-无需迭代所有注释,然后检查值-一次性完成:

comments_to_search_for = {'UNIQUE COMMENT', 'SECOND UNIQUE COMMENT'}
for comment in soup.find_all(text=lambda text: isinstance(text, Comment) and text in comments_to_search_for):
    print(comment.next_element.strip())
印刷品:

I would like to get this text
I would also like to find this text

我认为OP是试图在注释之间提取文本,而不是注释本身。
我想得到这个文本
-这个?是的,那个。我可以很好地提取评论。我刚才正要这么做+这正是我所需要的。非常感谢你。
I would like to get this text
I would also like to find this text