Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/70.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用XPath获取HTML元素的文本内容?_Html_Xml_Xpath_Html Parsing - Fatal编程技术网

使用XPath获取HTML元素的文本内容?

使用XPath获取HTML元素的文本内容?,html,xml,xpath,html-parsing,Html,Xml,Xpath,Html Parsing,看到这个html了吗 <div> <p> <span class="abc">Monitor</span> <b>$300</b> </p> <a href="/add">Add to cart</a> </div> <div> <p> <span class="abc">Keyboard<

看到这个html了吗

<div>
    <p>
    <span class="abc">Monitor</span> <b>$300</b>
    </p>
    <a href="/add">Add to cart</a>
</div>
<div>
    <p>
    <span class="abc">Keyboard</span> $20 
    </p>
    <a href="/add">Add to cart</a>
</div>
但它选择了
Monitor$300
。我不要标签。如何仅获取文本?

要选择所有子代文本,而不仅仅是子文本:

//div[a[contains(., "Add to cart")]]/p//text()
注意
p
text()
之间的双斜杠

这可能还包括许多标记间的空白,不过,您需要清理这些空白。使用
lxml
的示例:

>>> import lxml.etree as ET
>>> tree = ET.fromstring('''<div>
... <div>
...     <p>
...     <span class="abc">Monitor</span> <b>$300</b>
...     </p>
...     <a href="/add">Add to cart</a>
... </div>
... <div>
...     <p>
...     <span class="abc">Keyboard</span> $20 
...     </p>
...     <a href="/add">Add to cart</a>
... </div>
... </div>''')
>>> tree.xpath('//div[a[contains(., "Add to cart")]]/p//text()')
['\n    ', 'Monitor', ' ', '$300', '\n    ', '\n    ', 'Keyboard', ' $20 \n    ']
>>> res = _
>>> [txt for txt in (txt.strip() for txt in res) if txt]
['Monitor', '$300', 'Keyboard', '$20']
>>将lxml.etree作为ET导入
>>>tree=ET.fromstring(“”)
... 
…
…监视器300美元

... ... ... … …键盘20美元

... ... ... ''') >>>xpath('//div[a[包含(,“添加到购物车”)]]/p//text() ['\n'、''Monitor'、''$300'、'\n'、'\n'、'键盘'、'$20\n'] >>>res=_ >>>[txt代表txt(txt.strip()代表res中的txt)如果是txt] [‘显示器’、‘300美元’、‘键盘’、‘20美元’]
text()
不应选择元素。您使用的是什么XML解析器?@choroba
scrapy.selector.lxmlsel.HtmlXPathSelector
如何访问该值?在DOM Level 3 word中,您可以选择
p
元素,例如
//div[a[contains(,“Add to cart”)]/p
,然后访问
textContent
属性以获取纯文本内容。@MartinHonnen我正在使用
XPathSelector
Wow!那双
/
节省了我的时间很高兴为你工作。:-)我只是想让你明白空白是从哪里来的,以及如何清理它。
>>> import lxml.etree as ET
>>> tree = ET.fromstring('''<div>
... <div>
...     <p>
...     <span class="abc">Monitor</span> <b>$300</b>
...     </p>
...     <a href="/add">Add to cart</a>
... </div>
... <div>
...     <p>
...     <span class="abc">Keyboard</span> $20 
...     </p>
...     <a href="/add">Add to cart</a>
... </div>
... </div>''')
>>> tree.xpath('//div[a[contains(., "Add to cart")]]/p//text()')
['\n    ', 'Monitor', ' ', '$300', '\n    ', '\n    ', 'Keyboard', ' $20 \n    ']
>>> res = _
>>> [txt for txt in (txt.strip() for txt in res) if txt]
['Monitor', '$300', 'Keyboard', '$20']