Python 在Beauty Soup中替换所有智能引号_Python_Beautifulsoup

Python 在Beauty Soup中替换所有智能引号

python

Python 在Beauty Soup中替换所有智能引号,python,beautifulsoup,Python,Beautifulsoup,我有一个HTML文档，我想用常规引号替换所有智能引号。我试过这个： for text_element in html.findAll(): content = text_element.string if content: new_content = content \ .replace(u"\u2018", "'") \ .replace(u"\u2019", "'") \ .replace

我有一个HTML文档，我想用常规引号替换所有智能引号。我试过这个：

for text_element in html.findAll():
    content = text_element.string
    if content:
        new_content = content \
            .replace(u"\u2018", "'") \
            .replace(u"\u2019", "'") \
            .replace(u"\u201c", '"') \
            .replace(u"\u201d", '"') \
            .replace("e", "x")
        text_element.string.replaceWith(new_content)

（使用e/x转换只是为了方便查看是否正常工作）

但这是我的输出：

<p>
 This amount of investment is producing results: total final consumption in IEA countries is estimated to be
   <strong>
      60% lowxr
   </strong>
 today because of energy efficiency improvements over the last four decades. This has had the effect of
   <strong>
      avoiding morx xnxrgy consumption than thx total final consumption of thx Europxan Union in 2011
   </strong>
 .
</p>


这一投资额正在产生结果：国际能源机构国家的最终消费总额估计为

60%lowxr

今天，由于过去四十年来能源效率的提高。这产生了

避免了2011年thx Europxan Union的morx xnxrgy消耗量大于thx最终总消耗量

.

看起来BS正在深入到子est标记，但我需要获取整个页面中的所有文本

这是可行的，但也许有更干净的方法：

for text_element in html.findAll():
    for child in text_element.contents:
        if child:
            content = child.string
            if content:
                new_content = remove_smart_quotes(content)
                child.string.replaceWith(new_content)

不必选择和筛选所有元素/标记，只需通过为以下项指定

True

直接选择文本节点：

如文档所述，

string

参数在版本4.4.0中是新的，这意味着您可能需要使用

text

参数，具体取决于您的版本：

for text_node in soup.find_all(text=True):
  # do something with each text node

以下是替换值的相关代码：

def remove_smart_quotes (text):
  return text.replace(u"\u2018", "'") \
             .replace(u"\u2019", "'") \
             .replace(u"\u201c", '"') \
             .replace(u"\u201d", '"')

soup = BeautifulSoup(html, 'lxml')

for text_node in soup.find_all(string=True):
  text_node.replaceWith(remove_smart_quotes(text_node))

作为旁注，Beautiful Soup文档实际上有一个。

如果调用：

new\u content=str（html）.replace（u“\u2018”，“”）.replace会发生什么（…

？问题不在于替换部分——它在子元素上正常工作，只是没有影响到父元素。这就是为什么我想知道如果在整个汤中调用它会发生什么？也许我遗漏了什么？我在GitHub上有一个脚本，我使用字典调用对文件中的行的多个更改，并保存这些结果ts（原位）：“hn.py”在

def remove_smart_quotes (text):
  return text.replace(u"\u2018", "'") \
             .replace(u"\u2019", "'") \
             .replace(u"\u201c", '"') \
             .replace(u"\u201d", '"')

soup = BeautifulSoup(html, 'lxml')

for text_node in soup.find_all(string=True):
  text_node.replaceWith(remove_smart_quotes(text_node))