Python 美丽的汤和独角兽

Python 美丽的汤和独角兽,python,encoding,beautifulsoup,Python,Encoding,Beautifulsoup,我试图爬过一页,但我有一个UnicodeDecodeError。这是我的密码: def soup_def(link): req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"}) usock = urllib2.urlopen(req) encoding = usock.headers.getparam('charset') page = usock.read().decode(en

我试图爬过一页,但我有一个UnicodeDecodeError。这是我的密码:

def soup_def(link):
    req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"}) 
    usock = urllib2.urlopen(req)
    encoding = usock.headers.getparam('charset')
    page = usock.read().decode(encoding)
    usock.close()
    soup = BeautifulSoup(page)
    return soup

soup = soup_def("http://www.geekbuying.com/item/Ainol-Novo-10-Hero-II-Quad-Core--Tablet-PC-10-1-inch-IPS-1280-800-1GB-RAM-16GB-ROM-Android-4-1--HDMI-313618.html")
错误是:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 284: invalid start byte
我检查了几个用户是否有相同的错误,但我无法找到任何解决方案。

这是我从字符
0xff
中得到的,它是UTF-16的符号

UTF-16[edit]
In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.
If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF. This sequence appears as the ISO-8859-1 characters þÿ in a text display that expects the text to be ISO-8859-1.
if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE. This sequence appears as the ISO-8859-1 characters ÿþ in a text display that expects the text to be ISO-8859-1.
Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable).
所以我有两个想法:

(1) 这可能是因为应将其视为
utf-16
而不是
utf-8

(2) 发生错误的原因是您试图将整个汤打印到屏幕上。然后,它涉及到您的IDE(Eclipse/Pycharm)是否足够智能以显示这些unicode

如果我是你,我会尽量不打印整个汤,只收集你想要的那一块。看到你有问题达到这一步。如果没有问题的话,那为什么麻烦你不能把整个汤打印到屏幕上呢

如果确实要将汤打印到屏幕,请尝试:

print soup.prettify(encoding='utf-16')

另一种可能是您试图解析的隐藏文件(这在Mac上非常常见)

添加一个简单的if语句,以便您只创建实际上是html文件的BeautifulSoup对象:

for root, dirs, files in os.walk(folderPath, topdown = True):
    for fileName in files:
        if fileName.endswith(".html"):
            soup = BeautifulSoup(open(os.path.join(root, fileName)).read(), 'lxml')

值得一提的是:这段代码对我有效(在导入BeautifulSoup和urllib2之后)。对我来说,它10次工作2次。如果我跑啊跑啊跑,有时它会起作用。其他时间都没有。我不知道为什么。我正在做XML解析。在Eclipse中尝试
BeautifulSoup(打开(文件路径),“xml”)
时也会发生同样的错误。完全相同的代码在IPython笔记本中工作!两者都使用AnacondaPython 3.6,但我不打算打印它。我只是把它保存到变量“soup”中。也许你对utf-16的看法是对的,但我不能这样做,因为我不能先将它保存到变量中。