Python 禁用对'--';以lxml为单位 用例:

Python 禁用对'--';以lxml为单位 用例:,python,web-scraping,lxml,html5lib,Python,Web Scraping,Lxml,Html5lib,使用lxml解析失败 ... /opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment self.tree.insertComment(token, self.tree.openElements[-1]) /opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebu

使用lxml解析失败

...
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment
    self.tree.insertComment(token, self.tree.openElements[-1])
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree_lxml.py:312: in insertCommentMain
    super(TreeBuilder, self).insertComment(data, parent)
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/_base.py:262: in insertComment
    parent.appendChild(self.commentClass(token["data"]))
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree.py:148: in __init__
    self._element = ElementTree.Comment(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

- src/lxml/lxml.etree.pyx:3017: ValueError: Comment may not contain '--' or end with '-'
它来自lxml>

它在报纸上找到了不好的评论

  • 猴子补丁(更改代码、注入…)

  • 更新1: 我正在使用html5lib,希望获得声音、部分、视频等标签。。。在html5中提供

    from lxml.html import html5parser, fromstring
    
    context = fromstring(document.content) # work    
    context = html5parser.fromstring(document.content) # do not work
    
    context = html5lib.parse(  # do not work
        document.content,
        treebuilder="lxml",
        namespaceHTMLElements=document.namespace,
        encoding=document.encoding
    )
    
    版本:

    • html5lib==0.9999999
    • lxml==3.5.0(降级lxml也不是解决方案)
    更新2:: 这似乎是lxml中的改进/问题

    等待lxml开发人员的反馈

    更新3::
    得到反馈,似乎是html5lib的错误,github的上一个开发版本已经修复了

    由于这是您试图解析的HTML数据,请使用而不是
    lxml.etree

    为我工作:

    >>> import requests
    >>> import lxml.html
    >>> 
    >>> data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/").content
    >>> tree = lxml.html.fromstring(data)
    >>> tree.xpath("//title/text()")
    ['Tools and resources - Banca Romaneasca']
    

    基于github的@opottone,已经找到了解决方案:


    我尝试从安装最新的
    html5parser
    。现在我只得到一个警告,而不是一个错误。

    我更新了问题,并提供了更多细节@alecxe,我是否也能获得html5标签,如声音、视频、部分?@Andrei.Danciuc我不知道为什么不能,但试试看。谢谢否:(与html5不兼容(html5lib是一个Python包,它实现了html5解析算法,该算法受当前浏览器的严重影响,并且基于WHATWG html5规范)。考虑到问题的严重性,可能应该推出一个版本,实际上……我想知道他为什么不这么做before@gsnedders你用“他”称呼谁?在中遇到此问题(由于注释中的双连字符,它无法解析亚马逊的图书元数据)。升级
    html5lib
    version解决了此问题:
    sudo pip2安装html5lib-升级
    from lxml.html import html5parser, fromstring
    
    context = fromstring(document.content) # work    
    context = html5parser.fromstring(document.content) # do not work
    
    context = html5lib.parse(  # do not work
        document.content,
        treebuilder="lxml",
        namespaceHTMLElements=document.namespace,
        encoding=document.encoding
    )
    
    >>> import requests
    >>> import lxml.html
    >>> 
    >>> data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/").content
    >>> tree = lxml.html.fromstring(data)
    >>> tree.xpath("//title/text()")
    ['Tools and resources - Banca Romaneasca']