Python 从xpath中剥离信息？_Python_Python 2.7_Xpath_Html Parsing

Python 从xpath中剥离信息？

python python-2.7 xpath

Python 从xpath中剥离信息？,python,python-2.7,xpath,html-parsing,Python,Python 2.7,Xpath,Html Parsing,我使用以下代码行从网页获取CVE id： project.cve_information = "".join(xpath_parse(tree, '//div[@id="references"]/a/text()')).split() 但是，问题是： <div id='references'> <b>References:</b> <a href='https://access

我使用以下代码行从网页获取CVE id：

  project.cve_information = "".join(xpath_parse(tree, '//div[@id="references"]/a/text()')).split()

但是，问题是：

            <div id='references'>
            <b>References:</b>
            <a href='https://access.redhat.com/security/cve/CVE-2011-3256' target='_blank'>CVE-2011-3256&nbsp;<i class='icon-external-link'></i></a>
            <a href='https://rhn.redhat.com/errata/RHSA-2011-1402.html' target='_blank'>RHSA-2011-1402&nbsp;<i class='icon-external-link'></i></a><br />
        </div>

此表单仅对CVE值和错误执行验证，因为我的代码往往包含RHSA值。

您可以使用包含：

两者都会给你：

 ['https://access.redhat.com/security/cve/CVE-2011-3256']

嗯，我想我没有正确地解释我的问题。我正在使用xpath表达式解析“references”字段。然后，我在其他地方使用“CVE xxxx xxxx”id，这样它就可以。使用当前解决方案，我得到-警告：无法找到CVE CVE-2011-3256的信息-CVE-2011-3256前面的额外“CVE”您想要

CVE-2011-3256

？如果它们总是在末尾，只需重新拆分

并提取，如果它们可以在任何地方，则您需要一个正则表达式或拆分，并使用str.strartswith查找您想要的子字符串将xpath从

/@href

更改为

/text（）

太棒了！谢谢只是编辑了一点以得到我想要的。

h = """ <div id='references'>
            <b>References:</b>
            <a href='https://access.redhat.com/security/cve/CVE-2011-3256' target='_blank'>CVE-2011-3256&nbsp;<i class='icon-external-link'></i></a>
            <a href='https://rhn.redhat.com/errata/RHSA-2011-1402.html' target='_blank'>RHSA-2011-1402&nbsp;<i class='icon-external-link'></i></a><br />
        </div>"""

from lxml import html

xml = html.fromstring(h)

urls = xml.xpath('//div[@id="references"]/a[contains(@href, "CVE")]/@href')

urls = xml.xpath('//div[@id="references"]/a[not(contains(@href, "RHSA"))]/@href')

 ['https://access.redhat.com/security/cve/CVE-2011-3256']