Python BeautifulSoup不'；不要给我密码_Python_Unicode_Character Encoding_Beautifulsoup

Python BeautifulSoup不'；不要给我密码

python unicode character-encoding

Python BeautifulSoup不'；不要给我密码,python,unicode,character-encoding,beautifulsoup,Python,Unicode,Character Encoding,Beautifulsoup,我用漂亮的汤来刮数据。BS文档说明BS应该始终返回Unicode，但我似乎无法获得Unicode。下面是一段代码片段 import urllib2 from libs.BeautifulSoup import BeautifulSoup # Fetch and parse the data url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern' data = urllib2.ur

我用漂亮的汤来刮数据。BS文档说明BS应该始终返回Unicode，但我似乎无法获得Unicode。下面是一段代码片段

import urllib2
from libs.BeautifulSoup import BeautifulSoup

# Fetch and parse the data
url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern'

data = urllib2.urlopen(url).read()
print 'Encoding of fetched HTML : %s', type(data)

soup = BeautifulSoup(data)
print 'Encoding of souped up HTML : %s', soup.originalEncoding 

table = soup.table
print type(table.renderContents())

从页面返回的原始数据是一个字符串。BS显示原始编码为ISO-8859-1。我认为BS会自动将所有内容转换为Unicode，那么为什么我这样做：

table = soup.table
print type(table.renderContents())

…它给了我一个字符串对象而不是Unicode

如何从BS获取Unicode对象

我真的，真的迷路了。有什么帮助吗？提前感谢。

originalEncoding

正是源代码编码，因此BS在内部以unicode存储所有内容的事实不会改变该值。当您在树中漫游时，所有文本节点都是unicode，所有标记都是unicode，等等，除非您另外转换它们（例如使用

print

、

str

、

prettify

或

renderContents

）

试着做一些类似的事情：

soup = BeautifulSoup(data)
print type(soup.contents[0])

不幸的是，到目前为止，您所做的所有其他工作都在BS中找到了极少数转换为字符串的方法。

正如您可能已经注意到的，renderContent（默认情况下）返回一个UTF-8编码的字符串，但是如果您确实想要一个表示整个文档的Unicode字符串，您也可以使用Unicode（soup）或者使用unicode（soup.prettify（），“utf-8”）解码RenderContent/prettify的输出

相关的

它给了我

的

类型（soup.contents[0]）

和

的

类型（soup.contents[2]）

我查看了BS源代码，发现要获得Unicode字符串，必须调用

renderContents（None）

。这将返回Unicode。我不知道为什么文档中会有相反的说明。@mridang:是的，我应该给你一个文档来试一试——你的文档格式很好，所以

内容中的前几个元素将是元数据，它们将创建真正的BeautifulSoup
对象。可以尝试在文档中举例，也可以在树中查找真实的标记名和文本，而不使用文档中调用的方法，因为这些方法没有返回unicode（如renderContents
）。