Python 禁用对'--';以lxml为单位 用例:
使用lxml解析失败Python 禁用对'--';以lxml为单位 用例:,python,web-scraping,lxml,html5lib,Python,Web Scraping,Lxml,Html5lib,使用lxml解析失败 ... /opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment self.tree.insertComment(token, self.tree.openElements[-1]) /opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebu
...
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment
self.tree.insertComment(token, self.tree.openElements[-1])
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree_lxml.py:312: in insertCommentMain
super(TreeBuilder, self).insertComment(data, parent)
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/_base.py:262: in insertComment
parent.appendChild(self.commentClass(token["data"]))
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree.py:148: in __init__
self._element = ElementTree.Comment(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
- src/lxml/lxml.etree.pyx:3017: ValueError: Comment may not contain '--' or end with '-'
它来自lxml>
它在报纸上找到了不好的评论
from lxml.html import html5parser, fromstring
context = fromstring(document.content) # work
context = html5parser.fromstring(document.content) # do not work
context = html5lib.parse( # do not work
document.content,
treebuilder="lxml",
namespaceHTMLElements=document.namespace,
encoding=document.encoding
)
版本:
- html5lib==0.9999999
- lxml==3.5.0(降级lxml也不是解决方案)
得到反馈,似乎是html5lib的错误,github的上一个开发版本已经修复了 由于这是您试图解析的HTML数据,请使用而不是
lxml.etree
为我工作:
>>> import requests
>>> import lxml.html
>>>
>>> data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/").content
>>> tree = lxml.html.fromstring(data)
>>> tree.xpath("//title/text()")
['Tools and resources - Banca Romaneasca']
基于github的@opottone,已经找到了解决方案:
我尝试从安装最新的
html5parser
。现在我只得到一个警告,而不是一个错误。我更新了问题,并提供了更多细节@alecxe,我是否也能获得html5标签,如声音、视频、部分?@Andrei.Danciuc我不知道为什么不能,但试试看。谢谢否:(与html5不兼容(html5lib是一个Python包,它实现了html5解析算法,该算法受当前浏览器的严重影响,并且基于WHATWG html5规范)。考虑到问题的严重性,可能应该推出一个版本,实际上……我想知道他为什么不这么做before@gsnedders你用“他”称呼谁?在中遇到此问题(由于注释中的双连字符,它无法解析亚马逊的图书元数据)。升级html5lib
version解决了此问题:sudo pip2安装html5lib-升级
from lxml.html import html5parser, fromstring
context = fromstring(document.content) # work
context = html5parser.fromstring(document.content) # do not work
context = html5lib.parse( # do not work
document.content,
treebuilder="lxml",
namespaceHTMLElements=document.namespace,
encoding=document.encoding
)
>>> import requests
>>> import lxml.html
>>>
>>> data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/").content
>>> tree = lxml.html.fromstring(data)
>>> tree.xpath("//title/text()")
['Tools and resources - Banca Romaneasca']