Python中的离散性'；s lxml'；HTML上的等效解析方法：cssselect vs xpath_Python_Xpath_Css Selectors_Lxml

Python中的离散性'；s lxml'；HTML上的等效解析方法：cssselect vs xpath

python xpath

Python中的离散性'；s lxml'；HTML上的等效解析方法：cssselect vs xpath,python,xpath,css-selectors,lxml,Python,Xpath,Css Selectors,Lxml,我试图用xpath和cssselect进行解析，但似乎要么我不知道xpath是如何工作的，要么lxml的xpath被破坏了，因为它缺少匹配项这是快速而肮脏的代码从lxml.html导入* mySearchTree=parse（'http://www.example.com'）.getroot（）对于mySearchTree.cssselect（'tr a'）中的a：打印“找到”%s“链接到href”%s“%”（a.text，a.get（'href'））现在为Xpath'+8*'-'打印

我试图用xpath和cssselect进行解析，但似乎要么我不知道xpath是如何工作的，要么lxml的xpath被破坏了，因为它缺少匹配项

这是快速而肮脏的代码

从lxml.html导入*
mySearchTree=parse（'http://www.example.com'）.getroot（）
对于mySearchTree.cssselect（'tr a'）中的a：
打印“找到”%s“链接到href”%s“%”（a.text，a.get（'href'））
现在为Xpath'+8*'-'打印'-'*8+'
#使用xpath查找“tr”表行中的所有“a”元素
对于mySearchTree.xpath（'.//tr/*/a'）中的a：
打印“找到”%s“链接到href”%s“%”（a.text，a.get（'href'））

结果:

found "About" link to href "/about/"
found "Presentations" link to href "/about/presentations/"
found "Performance" link to href "/about/performance/"
found "Reports" link to href "/reports/"
found "Domains" link to href "/domains/"
found "Root Zone" link to href "/domains/root/"
found ".INT" link to href "/domains/int/"
found ".ARPA" link to href "/domains/arpa/"
found "IDN Repository" link to href "/domains/idn-tables/"
found "Protocols" link to href "/protocols/"
found "Number Resources" link to href "/numbers/"
found "Abuse Information" link to href "/abuse/"
found "Internet Corporation for Assigned Names and Numbers" link to href "http://www.icann.org/"
--------Now for Xpath--------
found "Presentations" link to href "/about/presentations/"
found "Performance" link to href "/about/performance/"
found "Reports" link to href "/reports/"
found "Root Zone" link to href "/domains/root/"
found ".INT" link to href "/domains/int/"
found ".ARPA" link to href "/domains/arpa/"
found "IDN Repository" link to href "/domains/idn-tables/"
found "Abuse Information" link to href "/abuse/"
found "Internet Corporation for Assigned Names and Numbers" link to href "http://www.icann.org/"

基本上xpath找到了它应该找到的每个链接，除了Example.com中粗体显示的链接。但是，星号通配符不应该在xpath匹配中允许这种情况。//tr/*/a'？

可能还有其他原因（我没有仔细检查示例文档），但是您的CSS选择器和xpath是不等效的

'tr a' -> '//tr//a'

CSS

tr a

是XPath中的

//tr//a

//tr/*/a

的意思是（从概念上讲，并不确切）：

：当前节点

：当前节点的所有后代

tr

：当前节点的所有子代中的所有tr元素

：找到的tr元素的所有子元素

：找到的tr元素的子元素中的任何元素

：找到的tr元素的任何子元素的所有子元素

：作为tr元素子元素的元素子元素的所有a元素

换句话说，给定以下HTML：

//ul/*/a

将只匹配link1

XPath入门实际上，“XPath”是一系列用斜杠分隔的定位步骤。定位步骤包括：

轴（例如子轴：）

节点测试（节点名称或特殊节点类型之一，例如
node（）
，
text（）
）

可选谓词（由
[]
包围。仅当所有谓词均为true时，节点才匹配。）
如果我们要将
//tr/*/a
分解为其位置步骤，它将如下所示：

（在“/”中斜线之间的“空格”）

tr

*

a
我到底在说什么可能还不清楚。这是因为XPath有一个缩写语法。以下是缩略语展开的表达式（轴和节点测试由
：
分隔，步骤由
/
分隔）：

self:：node（）/子代或self:：node（）/child:：tr/child:：*/child:：a
（请注意，
self:：node（）
是冗余的。）
从概念上讲，步骤中发生的是：

给定一组上下文节点（默认为当前节点或根节点的“/”）

对于每个上下文节点，创建一组满足位置步骤的节点

将所有每个上下文节点集合并为一个节点集

将该集合作为其给定的上下文节点传递到下一个位置步骤

重复上述步骤，直到不符合步骤。最后一步后留下的集合是整个路径的集合

请注意，这仍然是一种简化。如果需要的话，请阅读下面的详细信息。
可能还有其他问题（我没有仔细检查示例文档），但是您的CSS选择器和XPath并不等效
CSS
tr a
是XPath中的
//tr//a
//tr/*/a
的意思是（从概念上讲，并不确切）：

：当前节点

/
：当前节点的所有后代

tr
：当前节点的所有子代中的所有tr元素

/
：找到的tr元素的所有子元素

*
：找到的tr元素的子元素中的任何元素

/
：找到的tr元素的任何子元素的所有子元素

a
：作为tr元素子元素的元素子元素的所有a元素
换句话说，给定以下HTML：

//ul/*/a
将只匹配link1
XPath入门实际上，“XPath”是一系列用斜杠分隔的定位步骤。定位步骤包括：

轴（例如子轴：）

节点测试（节点名称或特殊节点类型之一，例如
node（）
，
text（）
）

可选谓词（由
[]
包围。仅当所有谓词均为true时，节点才匹配。）
如果我们要将
//tr/*/a
分解为其位置步骤，它将如下所示：

（在“/”中斜线之间的“空格”）

tr

*

a
我到底在说什么可能还不清楚。这是因为XPath有一个缩写语法。以下是缩略语展开的表达式（轴和节点测试由
：
分隔，步骤由
/
分隔）：

self:：node（）/子代或self:：node（）/child:：tr/child:：*/child:：a
（请注意，
self:：node（）
是冗余的。）
从概念上讲，步骤中发生的是：

给定一组上下文节点（默认为当前节点或根节点的“/”）

对于每个上下文节点，创建一组满足位置步骤的节点

将所有每个上下文节点集合并为一个节点集

通过