Python BeautifulSoup给了我unicode+；html符号，而不是直接的unicode。这是错误还是误解？_Python_Html_Unicode_Beautifulsoup

Python BeautifulSoup给了我unicode+；html符号，而不是直接的unicode。这是错误还是误解？

python html unicode

Python BeautifulSoup给了我unicode+；html符号，而不是直接的unicode。这是错误还是误解？,python,html,unicode,beautifulsoup,Python,Html,Unicode,Beautifulsoup,我正在使用BeautifulSoup清理一个网站。网站页面在我的浏览器中呈现良好：国际乐施会题为“越位！特别是单引号和双引号看起来不错，它们看起来像html符号而不是ascii，尽管奇怪的是，当我在FF3中查看源代码时，它们看起来像是普通的ascii 不幸的是，当我刮的时候，我得到了这样的东西国际乐施会™s报告标题为越位哎呀，我是说： u'Oxfam International\xe2€™s report entitled \xe2€œOffside! 页面的元数据表示“iso-8

我正在使用BeautifulSoup清理一个网站。网站页面在我的浏览器中呈现良好：

国际乐施会题为“越位！

特别是单引号和双引号看起来不错，它们看起来像html符号而不是ascii，尽管奇怪的是，当我在FF3中查看源代码时，它们看起来像是普通的ascii

不幸的是，当我刮的时候，我得到了这样的东西

国际乐施会™s报告标题为越位

哎呀，我是说：

u'Oxfam International\xe2€™s report entitled \xe2€œOffside!

页面的元数据表示“iso-88959-1”编码。我尝试了不同的编码，使用unicode->ascii和html->ascii第三方函数，并查看了MS/iso-8859-1的差异，但事实是™ 与单引号无关，而且我似乎无法将unicode+htmlsymbol组合转换为正确的ascii或html符号——以我有限的知识，这就是我寻求帮助的原因

我很乐意使用ascii双引号“或”

下面的问题是，我担心有其他搞笑的符号解码错误

\xe2€™

下面是一些python来重现我所看到的，然后是我尝试过的东西

import twill
from twill import get_browser
from twill.commands import go

from BeautifulSoup import BeautifulSoup as BSoup

url = 'http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271'
twill.commands.go(url)
soup = BSoup(twill.commands.get_browser().get_html())
ps = soup.body("p")
p = ps[52]

>>> p         
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 22: ordinal not in range(128)

>>> p.string
u'Oxfam International\xe2€™s report entitled \xe2€œOffside!<elided>\r\n'

这给了我这个

u'<p>Oxfam International\xe2\u20ac\u2122s report entitled \xe2\u20ac\u0153Offside!

国际乐施会\xe2\u20ac\u2122题为\xe2\u20ac\u0153越位的报告！最佳案例解码似乎给了我同样的结果：

unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')
'<p>Oxfam InternationalTMs report entitled Offside!

unicodedata.normalize（'NFKC'，p.decode（））.encode（'ascii'，'ignore'））
国际乐施会题为越位的报告！

编辑2：

我正在使用FF3.0.7和Firebug运行MacOSX4

Python2.5（哇，真不敢相信我从一开始就没有说出来）

这是一个严重混乱的页面，编码方面：-）

你的方法根本没有什么问题。我可能倾向于在将其传递给BeautifulSoup之前进行转换，因为我很挑剔：

import urllib
html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
h = html.decode('iso-8859-1')
soup = BeautifulSoup(h)

在本例中，页面的meta标记与编码有关。页面实际上是utf-8…Firefox的页面信息显示了真正的编码，您可以在服务器返回的响应标题中看到此字符集：

curl -i http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271
HTTP/1.1 200 OK
Connection: close
Date: Tue, 10 Mar 2009 13:14:29 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Set-Cookie: COMPANYID=271;path=/
Content-Language: en-US
Content-Type: text/html; charset=UTF-8

如果您使用“utf-8”进行解码，它将适用于您（或者至少适用于我）：

它实际上是UTF-8编码为CP1252：

>>> print u'Oxfam International\xe2€™s report entitled \xe2€œOffside!'.encode('cp1252').decode('utf8')
Oxfam International’s report entitled “Offside!

“哦，我是说这个”是什么意思？你的退格键不起作用了吗？@S.Lott:mac电脑里有退格键吗？@silenghost:我的每台mac电脑上都有一个。“哦，我是说这个”是非常非常恼人的。为什么不退格呢？在不同的标记中重复相同的字符有什么重要的？它“有趣”吗？那是一个意外。第一个是块引号，第二个是代码。我认为其中一个可能会显示我编写的实际文本，而不是呈现符号。（我认为代码在预览中是这样做的；看到你的评论后，惊讶地发现它看起来是一样的。没有必要生气）非常感谢您提供的信息丰富且温和的回复。这确实对我也很有用。

curl -i http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271
HTTP/1.1 200 OK
Connection: close
Date: Tue, 10 Mar 2009 13:14:29 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Set-Cookie: COMPANYID=271;path=/
Content-Language: en-US
Content-Type: text/html; charset=UTF-8

import urllib
html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
h = html.decode('utf-8')
soup = BeautifulSoup(h)
ps = soup.body("p")
p = ps[52]
print p

>>> print u'Oxfam International\xe2€™s report entitled \xe2€œOffside!'.encode('cp1252').decode('utf8')
Oxfam International’s report entitled “Offside!