Python 禁用对'--'；以lxml为单位用例：_Python_Web Scraping_Lxml_Html5lib

Python 禁用对'--'；以lxml为单位用例：

python web-scraping

Python 禁用对'--'；以lxml为单位用例：,python,web-scraping,lxml,html5lib,Python,Web Scraping,Lxml,Html5lib,使用lxml解析失败 ... /opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment self.tree.insertComment(token, self.tree.openElements[-1]) /opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebu

使用lxml解析失败

...
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment
    self.tree.insertComment(token, self.tree.openElements[-1])
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree_lxml.py:312: in insertCommentMain
    super(TreeBuilder, self).insertComment(data, parent)
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/_base.py:262: in insertComment
    parent.appendChild(self.commentClass(token["data"]))
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree.py:148: in __init__
    self._element = ElementTree.Comment(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

- src/lxml/lxml.etree.pyx:3017: ValueError: Comment may not contain '--' or end with '-'

它来自lxml>

它在报纸上找到了不好的评论

猴子补丁（更改代码、注入…）

更新1：我正在使用html5lib，希望获得声音、部分、视频等标签。。。在html5中提供

from lxml.html import html5parser, fromstring

context = fromstring(document.content) # work    
context = html5parser.fromstring(document.content) # do not work

context = html5lib.parse(  # do not work
    document.content,
    treebuilder="lxml",
    namespaceHTMLElements=document.namespace,
    encoding=document.encoding
)

版本：

html5lib==0.9999999
lxml==3.5.0（降级lxml也不是解决方案）

更新2:：这似乎是lxml中的改进/问题

等待lxml开发人员的反馈

更新3:：

得到反馈，似乎是html5lib的错误，github的上一个开发版本已经修复了

由于这是您试图解析的HTML数据，请使用而不是

lxml.etree

为我工作：

>>> import requests
>>> import lxml.html
>>> 
>>> data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/").content
>>> tree = lxml.html.fromstring(data)
>>> tree.xpath("//title/text()")
['Tools and resources - Banca Romaneasca']

基于github的@opottone，已经找到了解决方案：

我尝试从安装最新的

html5parser

。现在我只得到一个警告，而不是一个错误。

我更新了问题，并提供了更多细节@alecxe，我是否也能获得html5标签，如声音、视频、部分？@Andrei.Danciuc我不知道为什么不能，但试试看。谢谢否：（与html5不兼容（html5lib是一个Python包，它实现了html5解析算法，该算法受当前浏览器的严重影响，并且基于WHATWG html5规范）。考虑到问题的严重性，可能应该推出一个版本，实际上……我想知道他为什么不这么做before@gsnedders你用“他”称呼谁？在中遇到此问题（由于注释中的双连字符，它无法解析亚马逊的图书元数据）。升级

html5lib

version解决了此问题：

sudo pip2安装html5lib-升级

from lxml.html import html5parser, fromstring

context = fromstring(document.content) # work    
context = html5parser.fromstring(document.content) # do not work

context = html5lib.parse(  # do not work
    document.content,
    treebuilder="lxml",
    namespaceHTMLElements=document.namespace,
    encoding=document.encoding
)

>>> import requests
>>> import lxml.html
>>> 
>>> data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/").content
>>> tree = lxml.html.fromstring(data)
>>> tree.xpath("//title/text()")
['Tools and resources - Banca Romaneasca']