Python 使用BeautifulSoup删除所有内联样式_Python_Css_Beautifulsoup_Inline

Python 使用BeautifulSoup删除所有内联样式

python css

Python 使用BeautifulSoup删除所有内联样式,python,css,beautifulsoup,inline,Python,Css,Beautifulsoup,Inline,我正在用BeautifulSoup清理HTML。对Python和BeautifulSoup来说都是Noob。根据我在Stackoverflow上其他地方找到的答案，我已经正确地删除了如下标记： [s.extract() for s in soup('script')] 但是如何删除内联样式呢？例如： <p class="author" id="author_id" name="author_name" style="color:red;">Text</p> <im

我正在用BeautifulSoup清理HTML。对Python和BeautifulSoup来说都是Noob。根据我在Stackoverflow上其他地方找到的答案，我已经正确地删除了如下标记：

[s.extract() for s in soup('script')]

但是如何删除内联样式呢？例如：

<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>
<img class="some_image" href="somewhere.com">

文本

应成为：

<p>Text</p>
<img href="somewhere.com">

文本

如何删除所有元素的内联类、id、名称和样式属性

其他类似问题的答案我可以找到所有提到的使用CSS解析器来处理这一点，而不是美化组，但由于任务只是删除而不是操纵属性，并且是所有标记的总括规则，我希望能找到一种在BeautifulSoup中实现所有功能的方法。

如果您只想删除所有CSS，则无需解析任何CSS。BeautifulSoup提供了一种删除整个属性的方法，如下所示：

for tag in soup():
    for attribute in ["class", "id", "name", "style"]:
        del tag[attribute]

此外，如果只想删除整个标记（及其内容），则不需要返回标记的

extract（）

。您只需要

分解（）

：

没有太大的区别，但只是我在查看文档时发现的其他东西。您可以在中找到有关API的更多详细信息，并提供了许多示例。

我不会在

BeautifulSoup

中这样做-您将花费大量时间尝试、测试和处理边缘情况

Bleach

正是为您这样做的

如果你要在

BeautifulSoup

中这样做，我建议你使用“白名单”方法，就像

漂白剂那样。确定哪些标记可能具有哪些属性，并去除每个不匹配的标记/属性
 基于jmk的函数，我使用此函数删除白名单上的属性：
在python2、BeautifulSoup3中工作
def clean(tag,whitelist=[]):
    tag.attrs = None
    for e in tag.findAll(True):
        for attribute in e.attrs:
            if attribute[0] not in whitelist:
                del e[attribute[0]]
        #e.attrs = None     #delte all attributes
    return tag

#example to keep only title and href
clean(soup,["title","href"])

以下是我针对Python3和BeautifulSoup4的解决方案：
def remove_attrs(soup, whitelist=tuple()):
    for tag in soup.findAll(True):
        for attr in [attr for attr in tag.attrs if attr not in whitelist]:
            del tag[attr]
    return soup

它支持应保留的属性的白名单。：）如果没有提供白名单，所有属性都将被删除。
不完美，但简短：
' '.join([el.text for tag in soup for el in tag.findAllNext(whitelist)]);

lxml的清洁剂呢
from lxml.html.clean import Cleaner

content_without_styles = Cleaner(style=True).clean_html(content)

酷，我不知道漂白剂。我没有考虑用例，但是如果目标是清理不受信任的HTML，那么这显然是一个更好的方法。你得到我的选票！漂白剂很不错。我真的很喜欢它。我使用extract（）以防我决定在任何时候生成一个已删除代码的列表，但是decompose（）也可以完全删除和销毁标记和内容。感谢属性删除代码段，效果非常好！有道理。我将把关于decompose（）的注释留给任何可能偶然发现这一点的人。您不应该将可变结构作为默认函数参数值传递。如你所见。
from lxml.html.clean import Cleaner

content_without_styles = Cleaner(style=True).clean_html(content)