Python 使用Beautiful Soup时，无需猜测即可正确检测编码_Python_Character Encoding_Beautifulsoup

Python 使用Beautiful Soup时，无需猜测即可正确检测编码

python character-encoding

Python 使用Beautiful Soup时，无需猜测即可正确检测编码,python,character-encoding,beautifulsoup,Python,Character Encoding,Beautifulsoup,我正在努力改进Python IRC bot对字符编码的支持，该bot检索在频道中提到URL的页面的标题我目前使用的流程如下： : : title=soup.title.string.replace（'\n'，''）.replace（…）等指定from_encoding=r.encoding是一个好的开始，因为它允许我们在解析页面时注意内容类型标题中的字符集与更通用的模块ftfy不同，Unicode，Dammit采用的方法正是我想要的（请参见bs4/Dammit.py）。它注意任何标记提

我正在努力改进Python IRC bot对字符编码的支持，该bot检索在频道中提到URL的页面的标题

我目前使用的流程如下：

title=soup.title.string.replace（'\n'，''）.replace（…）

等

指定

from_encoding=r.encoding

是一个好的开始，因为它允许我们在解析页面时注意

内容类型

标题中的

字符集

与更通用的模块ftfy不同，Unicode，Dammit采用的方法正是我想要的（请参见

bs4/Dammit.py

）。它注意任何

标记提供的信息，而不是对问题进行更多的盲目猜测

然而，当使用

r.text

时，请求试图通过从

内容类型

标题中自动解码带有

charset

的页面来提供帮助，返回到ISO 8859-1，在ISO 8859-1中它不存在，但是Unicode，Dammit不会触及任何已经在

Unicode

字符串中的标记

我选择的解决方案是使用

r.content

：

r=requests.get（url，headers={'User-Agent'：'…}）

soup=bs4.BeautifulSoup（r.content）

title=soup.title.string.replace（'\n'，''）.replace（…）

等

我能看到的唯一缺点是，在

内容类型中只有字符集的页面将受到Unicode的一些猜测，该死，因为传递BeautifulSoup
from\u encoding=r.encoding
参数将覆盖Unicode，完全该死。
似乎您更喜欢文档中声明的编码，而不是HTTP头中声明的编码。UnicodeAmmit（由BeautifulSoup内部使用）则会以另一种方式执行此操作，如果您只是从报头向其传递编码。您可以通过读取文档中声明的编码并将其传递给try first来克服这一问题。大致（未经测试！）：
r = requests.get(url, headers={ 'User-Agent': '...' })

soup = bs4.BeautifulSoup(r.text, from_encoding=r.encoding)

r = requests.get(url, headers={ 'User-Agent': '...' })

is_html = content_type_header.split(';', 1)[0].lower().startswith('text/html')
declared_encoding = UnicodeDammit.find_declared_encoding(r.text, is_html=is_html)

encodings_to_try = [r.encoding]
if declared_encoding is not None:
    encodings_to_try.insert(0, declared_encoding)
soup = bs4.BeautifulSoup(r.text, from_encoding=encodings_to_try)

title = soup.title...