Python Unicode在html.parser中的消失_Python_Unicode_Utf 8_Python 3.x_Python Unicode

Python Unicode在html.parser中的消失

python unicode utf-8 python-3.x

Python Unicode在html.parser中的消失,python,unicode,utf-8,python-3.x,python-unicode,Python,Unicode,Utf 8,Python 3.x,Python Unicode,我正在使用Unicode字符从某些网页中提取HTML，如下所示： def extract(url): """ Adapted from Python3_Google_Search.py """ user_agent = ("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) " "AppleWebKit/525.13 (KHTML, like Gecko)"

我正在使用Unicode字符从某些网页中提取HTML，如下所示：

def extract(url):
     """ Adapted from Python3_Google_Search.py """
     user_agent = ("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
                   "AppleWebKit/525.13 (KHTML,     like Gecko)"
                   "Chrome/0.2.149.29 Safari/525.13")
     request = urllib.request.Request(url)
     request.add_header("User-Agent",user_agent)
     response = urllib.request.urlopen(request)
     html = response.read().decode("utf8")
     return html

正如你所看到的，我正在正确地解码。因此

html

现在是一个unicode字符串。打印html时，我可以看到Unicode字符

我正在使用

html.parser

解析html并将其子类化：

from html.parser import HTMLParser
class Parser(HTMLParser):
  def __init__(self):
    ## some init stuff
  #### rest of class

使用类的

句柄\u数据

解析HTML时，Unicode字符似乎被删除/突然消失。文档中没有提到任何关于编码的内容。为什么HTML解析器要删除非ascii字符，我如何解决这个问题？

显然，

HTML.Parser

在遇到非ascii字符时将调用

handle\u entityref

。它传递命名字符引用，要将其转换为unicode字符，我使用：

html.entities.html5[name]

Python的文档没有提到这一点。我从未见过比Python更糟糕的文档。

您使用什么程序/工具查看输出？1。您是否100%确定脚本接收的数据中包含字符，以及2。如何验证非ascii字符是否已“消失”？我在终端中使用了Emacs（启用Unicode编码），然后再次使用Mac TextEdit。@MartijnPieters，当我在返回

提取函数之前打印html
时，我看到以下内容：Ö；斯特雷希
。因此，是的，我100%确定我的脚本收到了正确的unicode字符。打开我写给的文本文件，看到它们不在那里，我正在验证unicode字符是否消失。@Darksky:这些是HTML转义码，只使用ASCII字符。其他一些东西正在删除这些，到目前为止，这与Python无关<代码>Ö
是6个字符，一个符号，一个大写字母O
，小写字母u
，m
和l
，然后是分号。