使用lxml在python中编码-复杂解决方案_Python_Lxml

使用lxml在python中编码-复杂解决方案

python

使用lxml在python中编码-复杂解决方案,python,lxml,Python,Lxml,我需要下载和解析的lxml网页，并建立UTF-8XML输出。我认为伪代码中的模式更能说明问题： from lxml import etree webfile = urllib2.urlopen(url) root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True)) txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=

我需要下载和解析的lxml网页，并建立UTF-8XML输出。我认为伪代码中的模式更能说明问题：

from lxml import etree

webfile = urllib2.urlopen(url)
root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True))

txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8))


output = etree.Element("out")
output.text = txt

outputfile.write(etree.tostring(output, encoding=utf8))

所以webfile可以采用任何编码（lxml应该可以处理这个问题）。输出文件必须为utf-8格式。我不确定在哪里使用编码。这个模式行吗？（我找不到关于lxml和编码的好教程，但我可以发现很多问题…）我需要健壮的解决方案

编辑：

因此，为了将utf-8发送到lxml，我使用

        converted = UnicodeDammit(webfile, isHTML=True)
        if not converted.unicode:
            print "ERR. UnicodeDammit failed to detect encoding, tried [%s]", \
                ', '.join(converted.triedEncodings)
            continue
        webfile = converted.unicode.encode('utf-8')

lxml在输入编码方面可能有点不稳定。最好是输入UTF8，输出UTF8

您可能希望使用模块或解码实际数据

你可能想做一些模糊的事情，比如：

import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)

我不确定您为什么要在lxml和etree之间移动，除非您正在与另一个已经使用etree的库交互？

lxml编码检测是必需的

但是，请注意，web页面最常见的问题是缺少属于（或存在不正确的）编码声明。它是因此通常只使用足够的编码进行检测 BeautifulSoup，称为UnicodeAmmit，剩下的留给lxml 自己的HTML解析器，速度快几倍

我建议使用unicodammit检测编码，并使用lxml解析。此外，您还可以使用http头内容类型（您需要提取charset=ENCODING_NAME）来更精确地检测编码

在本例中，我使用的是BeautifulSoup4（您还必须安装chardet，以便更好地进行自动检测，因为UnicodeMit在内部使用chardet）：

或者，要使前面的答案更完整，您可以将其修改为：

if ud.original_encoding != 'utf-8':
    content = content.decode(ud.original_encoding, 'replace').encode('utf-8')

为什么这比简单地使用chardet要好

您不会忽略内容类型HTTP头

内容类型：text/html；字符集=utf-8

您不会忽略httpequiv元标记。例如：

。。。http equiv=“Content Type”Content=“text/html；charset=UTF-8”

除此之外，您还使用了chardet、cjkcodes和iconvcodec编解码器和

看起来不错。关于etree，你是对的，我已经从代码中删除了它。为什么不直接将解码字符串（unicode对象）传递给html.fromstring（），而不是将其重新编码为utf-8呢？我不记得两年半前的动机是什么，但我确实模糊地记得lxml在某些情况下不喜欢unicode输入。有一个很好的机会，无论问题是什么，它得到了修复，所以现在最好忽略这一部分。不过，libxml2（为lxml提供动力）确实喜欢UTF-8输入，所以如果您对性能非常敏感，您可能特别希望避免解码该编码。

if ud.original_encoding != 'utf-8':
    content = content.decode(ud.original_encoding, 'replace').encode('utf-8')