Python 具有类属性的BeautifulSoup findall-unicode编码错误_Python_Beautifulsoup

Python 具有类属性的BeautifulSoup findall-unicode编码错误

python

Python 具有类属性的BeautifulSoup findall-unicode编码错误,python,beautifulsoup,Python,Beautifulsoup,我正在使用BeautifulSoup从中提取新闻故事（仅标题），并且到目前为止有这么多- import urllib2 from BeautifulSoup import BeautifulSoup HN_url = "http://news.ycombinator.com" def get_page(): page_html = urllib2.urlopen(HN_url) return page_html def get_stories(content):

我正在使用BeautifulSoup从中提取新闻故事（仅标题），并且到目前为止有这么多-

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles_html =[]

    for td in soup.findAll("td", { "class":"title" }):
        titles_html += td.findAll("a")

    return titles_html

print get_stories(get_page()

)

但是，当我运行代码时，它会给出一个错误-

Traceback (most recent call last):
  File "terminalHN.py", line 19, in <module>
    print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)

回溯（最近一次呼叫最后一次）：
文件“terminalHN.py”，第19行，在
打印get_故事（get_page（））
UnicodeEncodeError:“ascii”编解码器无法对位置131中的字符u'\xe2'进行编码：序号不在范围内（128）

如何使其工作？

它工作正常，但输出中断。要么显式编码到控制台的字符集，要么找到不同的方式运行代码（例如，从空闲中运行）。

因为BeautifulSoup在内部使用unicode字符串。将unicode字符串打印到控制台将导致Python尝试将unicode转换为Python的默认编码（通常为ascii）。对于非ascii网站，这通常会失败。通过谷歌搜索“Python+Unicode”，您可以了解Python和Unicode的基本知识。同时转换使用将unicode字符串转换为utf-8

print some_unicode_string.decode('utf-8')

关于代码，需要注意的一点是

findAll

返回一个列表（在本例中是一个BeautifulSoup对象列表），而您只需要标题。您可能想改用

find

。而不是打印出一个BeautifulSoup对象的列表，你说你只想要标题。例如，以下操作很好：

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles = []

    for td in soup.findAll("td", { "class":"title" }):
        a_element = td.find("a")
        if a_element:
            titles.append(a_element.string)

    return titles

print get_stories(get_page())

因此，现在

get_stories（）

返回一个

unicode

对象列表，该列表将按预期打印。

您希望

.encode（'utf-8'）

将unicode字符串转换为utf-8编码字符串。