Python 使用BeautifulSoup调整DOM树中的所有文本_Python_Beautifulsoup

Python 使用BeautifulSoup调整DOM树中的所有文本

python

Python 使用BeautifulSoup调整DOM树中的所有文本,python,beautifulsoup,Python,Beautifulsoup,我试图将HTML文件中的所有（用户可见）文本大写。显而易见的是： from bs4 import BeautifulSoup def upcaseAll(str): soup = BeautifulSoup(str) for tag in soup.find_all(True): for s in tag.strings: s.replace_with(unicode(s).upper()) return unicode(soup

我试图将HTML文件中的所有（用户可见）文本大写。显而易见的是：

from bs4 import BeautifulSoup

def upcaseAll(str):
    soup = BeautifulSoup(str)
    for tag in soup.find_all(True):
        for s in tag.strings:
            s.replace_with(unicode(s).upper())
    return unicode(soup)

那就崩溃了：

File "/Users/malvolio/flip.py", line 23, in upcaseAll
    for s in tag.strings:
  File "/Library/Python/2.7/site-packages/bs4/element.py", line 827, in _all_strings
    for descendant in self.descendants:
  File "/Library/Python/2.7/site-packages/bs4/element.py", line 1198, in descendants
    current = current.next_element
AttributeError: 'NoneType' object has no attribute 'next_element'

我能想到的所有变化都是一样的。BS4似乎不喜欢我更换很多Navigablestring。我该怎么做呢？

您不应该使用

str

作为函数参数，因为这是python内置的影子名称

您还应该能够通过使用

prettify

和formatter转换可见元素，如下所示：

...
return soup.prettify(formatter=lambda x: unicode(x).upper())

我现在已经测试过了，它可以工作了：

from bs4 import BeautifulSoup

import requests

r = requests.get('http://www.stackoverflow.com')

soup = BeautifulSoup(r.content)

print soup.prettify(formatter=lambda x: unicode(x).upper())[:200]
<!DOCTYPE html>
<html>
 <head>
  <title>
   STACK OVERFLOW
  </title>
  <link href="//CDN.SSTATIC.NET/STACKOVERFLOW/IMG/FAVICON.ICO?V=00A326F96F68" rel="SHORTCUT ICON"/>
  <link href="//CDN.SSTATIC.NE
  ...

从bs4导入美化组
导入请求
r=请求。获取（'http://www.stackoverflow.com')
汤=美汤（r.含量）
打印soup.prettify（格式化程序=lambda x:unicode（x）.upper（））[：200]
堆栈溢出
不如改用find_all（text=True）
？我想指出的是，您所指的文档。。。使用str
作为参数名。@Malvolio，非常好的位置。。。我相信他们犯了一个错误：）@Malvolio，实际上我需要纠正我的上述评论。我刚刚看过bs4源代码，这不是一个错误，而是一个非常聪明的实现方法。格式化程序将首先检查传递的参数是否可调用——str是，lambda也是，并且在它们的示例中，它们巧妙地重写str以返回str.upper（…），以实现大写转换。