Python 如何使用xpath从父html检索嵌套和非嵌套子项？_Python_Html_Xml_Xpath

Python 如何使用xpath从父html检索嵌套和非嵌套子项？

python html xml xpath

Python 如何使用xpath从父html检索嵌套和非嵌套子项？,python,html,xml,xpath,Python,Html,Xml,Xpath,我正在使用python创建一个网络爬虫。正在解析的html似乎有一些直接位于父标记中的字符串，如下所示： <div class="chapter-content3"> <noscript>...stuff here filtered successfully</noscript> <center>...stuff here filtered successfully</center> <h4>..stuff here sho

我正在使用python创建一个网络爬虫。正在解析的html似乎有一些直接位于父标记中的字符串，如下所示：

<div class="chapter-content3">
<noscript>...stuff here filtered successfully</noscript>
<center>...stuff here filtered successfully</center>
<h4>..stuff here shows</h4>
<p>...stuff here shows</h4>
<br>
"this stuff here doesnt show"
<br>
"this neither"
 <p>..stuff here shows</p>
 </div>

它会显示所有内容，但不会直接显示内部的字符串

我应该如何构造xpath以直接在父级中显示所有内容，包括字符串，几乎正确。在这里：

//div[@class="chapter-content3"]/*[
   not(self::noscript) and not(self::center) and not(@class="row")
]

仅选择实际元素。您希望选择所有节点，这将是

//div[@class="chapter-content3"]//node()[
   not(self::noscript) and not(self::center) and not(@class="row")
]

或者，再短一点

//div[@class="chapter-content3"]//node()[
   not(self::noscript or self::center or @class="row")
]

或者，另一种思考方式-所有文本节点，但祖先不正确的节点除外：

//div[@class="chapter-content3"]//text()[
   not(ancestor::noscript or ancestor::center or ancestor::*/@class="row")
]

要将所有内容都包含在一个xpath中吗？@Edwin，只要结果html与输入html的顺序相同。任何解决方案都可以

//div[@class="chapter-content3"]//text()[
   not(ancestor::noscript or ancestor::center or ancestor::*/@class="row")
]