Python 使用xpath从链接标记中提取超链接_Python_Xpath_Lxml

Python 使用xpath从链接标记中提取超链接

python xpath

Python 使用xpath从链接标记中提取超链接,python,xpath,lxml,Python,Xpath,Lxml,将html视为 <item> <title>this is the title</title> <link>www.linktoawebsite.com</link> </item> 但这返回的是一个空列表。但是，这将返回一个link元素 links=x.xpath('//item/link') #returns <Element link at 0xb6b0ae0c> links=x.xpat

将html视为

<item>
<title>this is the title</title>
<link>www.linktoawebsite.com</link>
</item>

但这返回的是一个空列表。但是，这将返回一个link元素

links=x.xpath('//item/link')        #returns <Element link at 0xb6b0ae0c>

links=x.xpath（'//item/link'）#返回

有人能建议如何从链接标签中提取URL吗？

通过

etree

解析内容，标签就会关闭。因此，链接标记不存在文本值

演示：

您正在为作业使用错误的解析器；你没有HTML，你有XML

正确的HTML解析器将忽略

标记的内容，因为在HTML规范中，该标记总是空的

使用

etree.parse（）

函数解析URL流（不需要单独的

.read（）

调用）：

您也可以使用

etree.fromstring（page）

，但将读取留给解析器更容易。

项

和

链接

不是有效的HTML元素；为什么要在这里使用

etree.HTML

？是的，你关于关闭标签的说法是对的，有没有办法让它的内容保持不变@Vivek Sableok，我认为所有解析器都是根据HTML规则实现的。我检查或使用字符串处理。将字符串视为s，然后我们可以使用s[s.find（“”）+len（“”）：s.find（“”）]。这是基本的字符串切片。是的，\n如果您需要更多帮助，请随时询问有关stackoverflow或vivekbsable@gmail.com（我的ID）@Taranjeet:再说一遍，为什么要在这里使用

etree.HTML（）

？您有XML，而不是HTML。

links=x.xpath('//item/link')        #returns <Element link at 0xb6b0ae0c>

>>> from lxml import etree
>>> content = """<item>
... <title>this is the title</title>
... <link>www.linktoawebsite.com</link>
... </item>"""
>>> x = etree.HTML(content)
>>> etree.tostring(x)
'<html><body><item>\n<title>this is the title</title>\n<link/>www.linktoawebsite.com\n</item></body></html>'
>>>

<head>
<link rel="stylesheet" type="text/css" href="theme.css">
</head>

response = urllib.urlopen(url)
tree = etree.parse(response)

titles = tree.xpath('//item/title/text()')
links = tree.xpath('//item/link/text()')