Python 3 UnicodeDecodeError:'；charmap'；编解码器可以'；t解码字节0x9d_Python_Unicode

Python 3 UnicodeDecodeError:'；charmap'；编解码器可以'；t解码字节0x9d

python unicode

Python 3 UnicodeDecodeError:'；charmap'；编解码器可以'；t解码字节0x9d,python,unicode,Python,Unicode,我想做一个搜索引擎，我在一些网站上学习教程。我想测试解析html from bs4 import BeautifulSoup def parse_html(filename): """Extract the Author, Title and Text from a HTML file which was produced by pdftotext with the option -htmlmeta.""" with open(filename) as infile:

我想做一个搜索引擎，我在一些网站上学习教程。我想测试解析html

from bs4 import BeautifulSoup

def parse_html(filename):
    """Extract the Author, Title and Text from a HTML file
    which was produced by pdftotext with the option -htmlmeta."""
    with open(filename) as infile:
        html = BeautifulSoup(infile, "html.parser", from_encoding='utf-8')
        d = {'text': html.pre.text}
        if html.title is not None:
            d['title'] = html.title.text
        for meta in html.findAll('meta'):
            try:
                if meta['name'] in ('Author', 'Title'):
                    d[meta['name'].lower()] = meta['content']
            except KeyError:
                continue
        return d

parse_html("C:\\pdf\\pydf\\data\\muellner2011.html")

这是一个错误

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 867: character maps to <undefined>enter code here

UnicodeDecodeError:“charmap”编解码器无法解码867位置的字节0x9d：字符映射以在此处输入代码

我在网上看到了一些使用encode（）的解决方案。但我不知道如何在代码中插入encode（）函数。有人能帮我吗？

在Python3中，文件以文本形式打开（解码为Unicode）；你不需要告诉BeautifulSoup从哪个编解码器解码

如果数据解码失败，那是因为您没有告诉

open（）

调用读取文件时要使用的编解码器；使用

编码

参数添加正确的编解码器：

with open(filename, encoding='utf8') as infile:
    html = BeautifulSoup(infile, "html.parser")

否则，该文件将使用依赖于操作系统的系统默认编解码器打开。

异常的完整回溯是什么？如果您不确定该文件是否为“utf-8”，并且希望跳过非utf8字节，则还可以在open（）中添加errors='ignore'，以避免出现“UnicodeDecodeError:'utf-8'编解码器无法解码字节”错误。从这里：@Altair7852那是。。。这是一个危险的选项，仅当您的输入是其他ASCII超集编解码器时才起作用。@Altair7852您链接到的帖子是专门关于读取PDF文件的，该文件甚至不是文本文件，而是二进制格式。作为文本打开是错误的。Martijn Pieters你是对的，链接的帖子在这里不是很相关，除了标志，是的-只有当你知道你在做什么时才使用它。作为辩护，我在阅读html文件时遇到了utf8问题，因此发表了评论。