查看标记之间的html文本（python、lxml、urllib、xpath）_Python_Html_Xpath_Lxml_Urllib

查看标记之间的html文本（python、lxml、urllib、xpath）

python html xpath

查看标记之间的html文本（python、lxml、urllib、xpath）,python,html,xpath,lxml,urllib,Python,Html,Xpath,Lxml,Urllib,我试图解析一些html，我想检索标记之间的实际html，但是我的代码给出了我认为是元素的位置以下是我目前的代码： import urllib.request, http.cookiejar from lxml import etree import io site = "http://somewebsite.com" cj = http.cookiejar.CookieJar() request = urllib.request.Request(site) opener = urllib.

我试图解析一些html，我想检索标记之间的实际html，但是我的代码给出了我认为是元素的位置

以下是我目前的代码：

import urllib.request, http.cookiejar
from lxml import etree
import io
site = "http://somewebsite.com"


cj = http.cookiejar.CookieJar()
request = urllib.request.Request(site)
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
request.add_header('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0')
html = etree.HTML(opener.open(request).read())

xpath = "//li[1]//cite[1]"
filtered_html = html.xpath(xpath)
print(filtered_html)

下面是一段html：

<div class="f kv">
<cite>
www.
<b>hello</b>
online.com/
</cite>
<span class="vshid">
</div>


www。
你好
在线/

当前我的代码返回：

[<Element cite at 0x36a65e8>, <Element cite at 0x36a6510>, <Element cite at 0x36a64c8>]

[，]

如何在cite标记之间提取实际的html代码？如果我在xpath的末尾添加“/text（）”，它会让我更接近，但它会忽略b标记中的内容。我的最终目标是让我的代码给我“www.helloonline.com/”

谢谢

使用

//text（）

从给定位置获取所有文本元素：

text = filtered_html.xpath('//text()')
print ''.join(t.strip() for t in text)  # prints "www.helloonline.com/"

html

，还是

text

？你想要

['www.'，'hello'，'online.com/']

吗？我想我必须先获取html并去掉标记，但实际上我想结合你的结果得到“www.helloonline.com/”