Python 如何处理lxml中的编码以正确解析html字符串?
我有一本书。请下载并保存为Python 如何处理lxml中的编码以正确解析html字符串?,python,lxml,Python,Lxml,我有一本书。请下载并保存为blog.xml。 这是我在Google blogger中的文件列表,我写了一些代码来解析它,有一个用lxml编写的东西 代码1: from stripogram import html2text import feedparser d = feedparser.parse('blog.xml') for num,entry in enumerate(d.entries): string=entry.content[0]['value'].encode("utf
blog.xml
。
这是我在Google blogger中的文件列表,我写了一些代码来解析它,有一个用lxml编写的东西
代码1:
from stripogram import html2text
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value'].encode("utf-8")
print html2text(string)
它使用代码1得到正确的结果
代码2:
import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value']
myhtml=lxml.html.document_fromstring(string)
print myhtml.text_content()
它得到了一个错误的输出代码2
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82659)
ValueError: Unicode strings with encoding declaration are not supported.
它得到了一个错误的输出代码3
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 599, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74827)
lxml.etree.XMLSyntaxError: line 1395: Tag b:include invalid
回溯(最近一次呼叫最后一次):
文件“”,第3行,在
文件“/usr/lib/python2.7/dist-packages/lxml/html/_-init___uuuu.py”,第532行,位于文档\u-fromstring中
value=etree.fromstring(html,解析器,**kw)
lxml.etree.fromstring(src/lxml/lxml.etree.c:54631)中的文件“lxml.etree.pyx”,第2754行
文件“parser.pxi”,第1578行,位于lxml.etree.\u parseMemoryDocument(src/lxml/lxml.etree.c:82748)
文件“parser.pxi”,第1457行,在lxml.etree.\u parseDoc(src/lxml/lxml.etree.c:81546)中
文件“parser.pxi”,第965行,在lxml.etree.\u BaseParser.\u parseDoc(src/lxml/lxml.etree.c:78216)中
文件“parser.pxi”,第569行,在lxml.etree.\u ParserContext.\u handleParseResultDoc(src/lxml/lxml.etree.c:74472)中
文件“parser.pxi”,第650行,在lxml.etree.\u handleParseResult(src/lxml/lxml.etree.c:75363)中
文件“parser.pxi”,第599行,在lxml.etree中。\u raiseParserError(src/lxml/lxml.etree.c:74827)
lxml.etree.xmlsyntaxer错误:第1395行:标记b:包含无效
如何处理lxml中的编码以正确解析html字符串?您可以自己创建一个解析器,而不是使用
document\u fromstring
:
from cStringIO import StringIO
from lxml import etree
for num, entry in enumerate(d.entries):
text = entry.content[0]['value'].encode('utf8')
parser = etree.HTMLParser()
tree = etree.parse(StringIO(text), parser)
print ''.join(tree.xpath('.//text()'))
对于Blogger.com Atom提要导出,这可以在lxml中打印.content[0].value
条目的文本内容。。
检查此代码的输出:
import lxml.html
import feedparser
def test():
try:
lxml.html.document_fromstring('')
except Exception as e:
print e
d = feedparser.parse('blog.xml')
e = d.entries[0].content[0]['value'].encode('utf-8')
test() # XMLSyntaxError: None
lxml.html.document_fromstring(e)
test() # XMLSyntaxError: line 1407: Tag b:include invalid
import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value'].encode("utf-8")
if not string:
continue
myhtml=lxml.html.document_fromstring(string)
print myhtml.text_content()
因此,这个错误令人困惑,解析失败的真正原因是您将空字符串传递给document\u fromstring
请尝试以下代码:
import lxml.html
import feedparser
def test():
try:
lxml.html.document_fromstring('')
except Exception as e:
print e
d = feedparser.parse('blog.xml')
e = d.entries[0].content[0]['value'].encode('utf-8')
test() # XMLSyntaxError: None
lxml.html.document_fromstring(e)
test() # XMLSyntaxError: line 1407: Tag b:include invalid
import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value'].encode("utf-8")
if not string:
continue
myhtml=lxml.html.document_fromstring(string)
print myhtml.text_content()
1.从lxml导入etree添加
。可能是打印树.text\u content()
3。但这是一个错误的输出:回溯(最近一次调用):文件“”,第5行,在AttributeError:'lxml.etree.\u ElementTree'对象没有属性“text\u content”回溯(最近一次调用):文件“”,第5行,在AttributeError中:“lxml.etree.\u元素”对象没有属性“text\u content”,仍然存在问题。@it\u是\u a\u文献:很抱歉,确实不存在该方法。我怀疑条目中存在解析错误,但lxml在错误的点忽略了异常。PythonC-API异常处理需要代码在某些点检查异常,如果没有这样做,那么当另一个异常发生并得到正确处理时,该异常会出现。如果省略第一个测试
调用,会发生什么情况?是否发生了相同的XMLSyntaxError
错误?无论如何,这肯定应该报告给LXML项目。@Martijn Pieters:是的,同样的错误也发生了,第一次test
调用只是为了显示XMLSyntaxError
消息在解析e
后发生了变化。再想想,该错误仍然反映了以前未处理的错误;这当然应该报告给开发者。我在他们的bug追踪器中找到了。