Python HTML到文本文件UnicodeDecodeError？_Python_Html_Python 2.7_Unicode_Urllib

Python HTML到文本文件UnicodeDecodeError？

python html python-2.7 unicode

Python HTML到文本文件UnicodeDecodeError？,python,html,python-2.7,unicode,urllib,Python,Html,Python 2.7,Unicode,Urllib,因此，我正在编写一个程序，使用urllib读取网页，然后使用“html2text”，将基本文本写入文件。但是，urllib.read（）中给出的原始内容具有不同的字符，因此它会不断引发UnicodeDecodeError 当然，我在谷歌上搜索了3个小时，得到了很多答案，比如使用HTMLPasser或reload（sys），使用pdfkit或BeautifulSoup等外部模块，当然还有.encode/.decode 重新加载sys，然后执行sys.setdefaultencoding（“utf-

因此，我正在编写一个程序，使用urllib读取网页，然后使用“html2text”，将基本文本写入文件。但是，urllib.read（）中给出的原始内容具有不同的字符，因此它会不断引发UnicodeDecodeError

当然，我在谷歌上搜索了3个小时，得到了很多答案，比如使用HTMLPasser或reload（sys），使用pdfkit或BeautifulSoup等外部模块，当然还有.encode/.decode

重新加载sys，然后执行sys.setdefaultencoding（“utf-8”）将获得所需的结果，但会使其处于空闲状态，此后程序将变得无响应

我使用“utf-8”和“ascii”尝试了.encode/.decode的每一个变体，并使用“replace”、“ignore”等参数。出于某种原因，无论我在编码/解码中提供的参数是什么，每次都会引发相同的错误

def download(self, url, name="WebPage.txt"):
    ## Saves only the text to file
    page = urllib.urlopen(url)
    content = page.read()
    with open(name, 'wb') as w:
        HP_inst = HTMLParser.HTMLParser()
        content = content.encode('ascii', 'xmlcharrefreplace')
        if True: 
            #w.write(HTT.html2text( (HP_inst.unescape( content ) ).encode('utf-8') ) )
            w.write( HTT.html2text( content) )#.decode('ascii', 'ignore')  ))
            w.close()
            print "Saved!"

必须有另一种方法或编码我失踪。。。请帮忙

sidequest：我有时不得不将其写入一个文件，其中的名称包含不支持的字符，如“G\u00e9za Teleki”+“.txt”。如何过滤掉这些字符

注:

此函数存储在一个类中（提示“self”）
使用python2.7
不想使用BeautfiulSoup
Windows 8 64位

您应该使用正确的编码对从urllib获取的内容进行解码，例如，utf-8 latin1取决于您获取的页面

检测内容编码的方法多种多样。来自html中的标题或元数据。我想使用一个编码侦探模块，我忘记了它的名字，你可以用谷歌搜索它

一旦正确解码，就可以在写入文件之前将其编码为您喜欢的任何编码

======================================

下面是使用

您必须知道远程网页使用的编码。实现这一点的方法有很多，但最简单的方法是使用Python请求库而不是urllib。请求返回预解码的Unicode对象

然后，您可以使用编码文件包装器自动编码您编写的每个字符

import requests
import io

def download(self, url, name="WebPage.txt"):
    ## Saves only the text to file
    req = requests.get(url)
    content = req.text # Returns a Unicode object decoded using the server's header
    with io.open(name, 'w', encoding="utf-8") as w: # Everything written to w is encoded to UTF-8
        w.write( HTT.html2text( content) )

    print "Saved"

你能举个例子吗？@ChrisNguyen我当时不太方便，这里我加上我的例子哦好吧，我知道编码是如何工作的。。。您必须使用其原始编码格式/方法对其进行解码？。。唯一的方法是使用外部库来检测编码吗？或者有没有一种没有外部模块的方法？我如何使用chardet？我下载了chardet.tar.gz并运行了“python setup.py安装”，但这里没有setuptools。。。不管怎样，要解决这个问题吗？下面，setuptools是安装第三方模块的基本组件。是否需要外部模块？如果是的话，我怎么得到它？。。默认的python库中有什么东西可以做到这一点吗？

import requests
import io

def download(self, url, name="WebPage.txt"):
    ## Saves only the text to file
    req = requests.get(url)
    content = req.text # Returns a Unicode object decoded using the server's header
    with io.open(name, 'w', encoding="utf-8") as w: # Everything written to w is encoded to UTF-8
        w.write( HTT.html2text( content) )

    print "Saved"