Python 如何使用beautifulsoup从网页中正确提取特殊字符？_Python_Html_Utf 8_Beautifulsoup

Python 如何使用beautifulsoup从网页中正确提取特殊字符？

python html utf-8

Python 如何使用beautifulsoup从网页中正确提取特殊字符？,python,html,utf-8,beautifulsoup,Python,Html,Utf 8,Beautifulsoup,我正在尝试使用beautifulsoup从给定url的网页中提取所有文本。我尝试运行在此处找到的代码：除了像“é”或“a”这样的特殊字符外，一切都很好。我尝试了一系列的修改，但都没有成功。这是我的密码： from bs4 import BeautifulSoup import requests import re import codecs html = requests.get(yourWebsiteURL).content unicode_str = html.decode('u

我正在尝试使用beautifulsoup从给定url的网页中提取所有文本。我尝试运行在此处找到的代码：

除了像“é”或“a”这样的特殊字符外，一切都很好。我尝试了一系列的修改，但都没有成功。这是我的密码：

from bs4 import BeautifulSoup
import requests
import re
import codecs

html = requests.get(yourWebsiteURL).content

unicode_str = html.decode('utf8')
encoded_str = unicode_str.encode("ascii",'ignore')
news_soup = BeautifulSoup(encoded_str, "html.parser")
a_text = news_soup.find_all('p')

y=[re.sub(r'<.+?>',r'',str(a)) for a in a_text]

file = codecs.open("textOutput.txt", "wb", encoding='utf-8')
file.write(str(y))
file.close()

从bs4导入美化组
导入请求
进口稀土
导入编解码器
html=requests.get（yourWebsiteURL.content）
unicode\u str=html.decode（'utf8'）
encoded_str=unicode_str.encode（“ascii”，“忽略”）
news\u soup=BeautifulSoup（编码的\u str，“html.parser”）
a_text=news_soup.find_all（'p'））
y=[a_文本中a的re.sub（r''，r''，str（a））]
file=codecs.open（“textOutput.txt”，“wb”，encoding='utf-8'）
文件写入（str（y））
file.close（）文件

然而，我确信问题来自于我对bs4的使用，因为我在写入文件时从未遇到过这个问题

encoded_str = unicode_str.encode("ascii",'ignore')

这行代码将文本编码为ascii码。Ascii不包含特殊字符，如é或á。我不知道为什么要将包含这些字符的UTF8解码为不包含这些字符的ascii。

作为旁注，使用

[a.text for a in a_text]

获取

标记之间的文本。你不需要正则表达式。那是在那页上给出的非常愚蠢的建议。提问者似乎不了解Unicode文本是什么，而您使用的答案是一种非常直截了当的处理非ASCII文本的方法。@KeyurPotdar:更好的方法是：

[a.get_text（）for a in a_text]

，然后您可以指定有关如何连接节的选项。它们也可以放弃解码，将其留给BeautifulSoup，HTML页面中可能有指定正确编解码器的元数据。是的，我将ascii更改为utf8，现在它工作正常。谢谢