Utf 8 美化组字符代码错误_Utf 8_Beautifulsoup_Codec

Utf 8 美化组字符代码错误

utf-8

Utf 8 美化组字符代码错误,utf-8,beautifulsoup,codec,Utf 8,Beautifulsoup,Codec,我正在使用BeautifulSoup来抓取网站信息。具体来说，我想通过谷歌搜索（标题、发明人、摘要等）收集有关专利的信息。我有每个专利的URL列表，但BeautifulSoup在某些网站上遇到了问题，给了我以下错误： UnicodeDecodeError:“utf8”编解码器无法解码位置531处的字节0xcc:无效的继续字节以下是错误回溯： Traceback (most recent call last): soup = BeautifulSoup(the_page,from_enc

我正在使用BeautifulSoup来抓取网站信息。具体来说，我想通过谷歌搜索（标题、发明人、摘要等）收集有关专利的信息。我有每个专利的URL列表，但BeautifulSoup在某些网站上遇到了问题，给了我以下错误：

UnicodeDecodeError:“utf8”编解码器无法解码位置531处的字节0xcc:无效的继续字节

以下是错误回溯：

Traceback (most recent call last):
    soup = BeautifulSoup(the_page,from_encoding='utf-8')
  File "C:\Python27\lib\site-packages\bs4\__init__.py", line 172, in __init__
    self._feed()
  File "C:\Python27\lib\site-packages\bs4\__init__.py", line 185, in _feed
    self.builder.feed(self.markup)
  File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed
    self.parser.close()
  File "parser.pxi", line 1209, in lxml.etree._FeedParser.close (src\lxml\lxml.etree.c:90597)
  File "parsertarget.pxi", line 142, in lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99984)
  File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99807)
  File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src\lxml\lxml.etree.c:9383)
  File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src\lxml\lxml.etree.c:95945)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 531: invalid continuation byte

我检查了网站的编码，它声称是“utf-8”。我还将其指定为BeautifulSoup的输入。下面是我的代码：

import urllib, urllib2
from bs4 import BeautifulSoup

#url = 'https://www.google.com/patents/WO2001019016A1?cl=en'  # This one works
url = 'https://www.google.com/patents/WO2006016929A2?cl=en' # This one doesn't work

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Somebody',
          'location' : 'Somewhere',
          'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()

print response.headers['content-type']
print response.headers.getencoding()

soup = BeautifulSoup(the_page,from_encoding='utf-8')

我包括两个网址。一个导致错误，另一个工作正常（在注释中标记为错误）。在这两种情况下，我都可以很好地将html打印到终端，但BeautifulSoup始终崩溃

有什么建议吗？这是我第一次使用BeautifulSoup。

您应该用UTF-8编码字符串：

soup = BeautifulSoup(the_page.encode('UTF-8'))

您应该用UTF-8编码字符串：

soup = BeautifulSoup(the_page.encode('UTF-8'))

我在windows上使用Python 2.7，BeautifulSoup4我在windows上使用Python 2.7，BeautifulSoup4