Python 如何使用BeautifulSoup从HTML中剥离注释标记？_Python_Beautifulsoup

Python 如何使用BeautifulSoup从HTML中剥离注释标记？

python

Python 如何使用BeautifulSoup从HTML中剥离注释标记？,python,beautifulsoup,Python,Beautifulsoup,我一直在玩BeautifulSoup，这很好。我的最终目标是尝试从页面中获取文本。我只是想从正文中获取文本，用一个特例从或中获取标题和/或alt属性直接从中，您可以使用extract（）轻松剥离注释（或任何内容）：从BeautifulSoup导入BeautifulSoup，注释汤=美丽的汤（“”1 23""") comments=soup.findAll（text=lambda text:isinstance（text，Comment）） [comment.extract（）用于注释中的注

我一直在玩BeautifulSoup，这很好。我的最终目标是尝试从页面中获取文本。我只是想从正文中获取文本，用一个特例从

或

中获取标题和/或alt属性

直接从中，您可以使用

extract（）

轻松剥离注释（或任何内容）：

从BeautifulSoup导入BeautifulSoup，注释
汤=美丽的汤（“”1
23""")
comments=soup.findAll（text=lambda text:isinstance（text，Comment））
[comment.extract（）用于注释中的注释]
印花汤
# 1
# 23

我仍在试图弄明白为什么会这样不会像这样查找和剥离标记：

。那些反斜杠会导致某些标签可能会被忽略

这可能是底层SGML解析器的问题：请参阅。您可以直接使用文档中的

markup按摩

regex来覆盖它：

<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->

重新导入，复制
My按摩=[（重新编译（'Bar
Baz

如果您在BeautifulSoup第3版中寻找解决方案

soup=BeautifulSoup（““你好！”）
comment=soup.find（text=re.compile（“if”））
注释=注释__
对于soup中的元素（text=lambda text:isinstance（text，Comment））：
element.extract（）
打印汤。美化

如果变异不是你的包，你可以

import re, copy

myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)

BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz

是否有一个源文档被用作测试用例？如果你能提供一些你想要的东西作为比较的基础，那将非常有帮助。我不知道为什么我没有看到。谢谢你唤醒我！很好。但是做一个带有副作用的列表理解看起来很讨厌：p.

map（lambda x:x.extract（），评论）

？我仍在试图找出为什么它找不到并去掉像这样的标签

这些反斜杠会导致某些标签被忽略在BeautifulSoup中有什么变化吗？我尝试了3.2.0，但它对像

这样的注释没有问题。这是一个困难的问题，看起来是一个很好的解决方法。遗憾的是，它仍然结束了使用正则表达式解析HTML。愚蠢的正则表达式！好吧，我将重新编译以检测我列出的混乱注释。不过需要重新学习我的正则表达式的.blech。@jathanism--beautifulsou在将HTML馈送到

sgmllib

之前在内部使用几个正则表达式来润色它。它不漂亮，但也不是Lovecraftian。只是为了升级在这篇旧文章中，BeautifulSoup.MARKUP\u消息已被弃用。“BeautifulSoup构造函数不再识别markupMassage参数。正确处理标记现在是解析器的责任。”（位于页面底部）

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
                        <a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

import re, copy

myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)

BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz

soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()
print soup.prettify()

[t for t in soup.find_all(text=True) if not isinstance(t, Comment)]