xpath查找页面中包含HTML的链接_Html_Xpath_Simplexml

xpath查找页面中包含HTML的链接

html xpath

xpath查找页面中包含HTML的链接,html,xpath,simplexml,Html,Xpath,Simplexml,这与我们的问题不同。我有，需要通过完整的foobaz.找到链接，包括结束点据我所知，XPath看不到原始HTML标记，它在HTML文档的抽象层上工作。尝试将HTML标记包含的尽可能多的信息合并到XPath表达式中会产生如下结果： //a[ node()[1][self::text() and .='foo '] /following-sibling::node()[1][self::em[@class='bar' and .='baz']] /following-sib

这与我们的问题不同。我有

，需要通过完整的

foobaz.

找到链接，包括结束点

据我所知，XPath看不到原始HTML标记，它在HTML文档的抽象层上工作。尝试将HTML标记包含的尽可能多的信息合并到XPath表达式中会产生如下结果：

//a[
    node()[1][self::text() and .='foo ']
    /following-sibling::node()[1][self::em[@class='bar' and .='baz']]
    /following-sibling::node()[1][self::text() and .='.']
]

所用谓词的简要说明：

node（）[1][self:：text（）and.='foo']

：让第一个子节点的文本节点的值等于

“foo”

/following sibling:：node（）[1][self:：em[@class='bar'and.='baz']]

：后跟

，类等于

“bar”

，值等于

“baz”

/following sibling:：node（）[1][self:：text（）and.='.]

：后跟值等于

的文本节点。

这不是100%，因为我们可以通过调用

string（）

剥离其他HTML标记，但就我而言，这看起来已经足够了：

//a[string() = 'bar baz.']/em[@class='bar' and .='baz']

注意：我正在跟进OP的评论

OP自己的答案（视觉上）更简单的变化可能是：

//a[. = "foo baz."][em[@class = "bar"] = "baz"]

甚至：

//a[.="foo baz." and em[@class="bar"]="baz"]

（假设您要选择

：
chapter[title=“Introduction”]
选择上下文节点的章节子节点，该节点有一个或多个标题子节点，其字符串值等于“Introduction”
稍后，关于布尔测试：
如果要比较的一个对象是节点集，而另一个对象是字符串，则当且仅当节点集中有节点时，比较才会为真，从而对节点的字符串值与另一个字符串执行比较的结果为真
在OP的回答中，//a[string（）='bar baz.]/em[@class='bar'和.='baz']
，需要
，因为对'baz'
的测试在上下文节点上
请注意，我的回答有点幼稚，假设只有一个的子级。

这个测试使用的是刮擦选择器

>>> import scrapy
>>> s = scrapy.Selector(text="""<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.""")
>>> s.xpath('//a[.="foo baz." and em[@class="bar"]="baz"]').extract_first()
u'<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>'
>>>

>>导入刮屑
>>>s=scrapy.Selector（text=“”））
>>>s.xpath（'//a[.=“foo baz.”和em[@class=“bar”]=“baz”]”）
u“
>>>

XPath匹配，但您可能不需要它。

您甚至可以编写

//a[.=“foo baz.][em[@class=“bar”]=“baz”]

（选择

节点）@paultrmbrth如果这是一个答案，我会接受它。但这为什么有效，为什么

[em[]=

不需要点？

>>> import scrapy
>>> s = scrapy.Selector(text="""<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>.""")
>>> s.xpath('//a[.="foo baz." and em[@class="bar"]="baz"]').extract_first()
u'<a href="http://example.com">foo <em class="bar">baz</em><em class="bar"></em>.</a>'
>>>