使用Python请求提取href URL
我想使用python中的requests包从xpath中提取URL。我可以得到文本,但我没有尝试提供URL。有人能帮忙吗使用Python请求提取href URL,python,python-3.x,xpath,python-requests,lxml,Python,Python 3.x,Xpath,Python Requests,Lxml,我想使用python中的requests包从xpath中提取URL。我可以得到文本,但我没有尝试提供URL。有人能帮忙吗 ipdb> webpage.xpath(xpath_url + '/text()') ['Text of the URL'] ipdb> webpage.xpath(xpath_url + '/a()') *** lxml.etree.XPathEvalError: Invalid expression ipdb> webpage.xpath(xpath_u
ipdb> webpage.xpath(xpath_url + '/text()')
['Text of the URL']
ipdb> webpage.xpath(xpath_url + '/a()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/href()')
*** lxml.etree.XPathEvalError: Invalid expression
ipdb> webpage.xpath(xpath_url + '/url()')
*** lxml.etree.XPathEvalError: Invalid expression
我使用本教程开始学习:
看起来应该很容易,但在我的搜索过程中什么都没有出现
谢谢。您最好使用: 您可以打印该行,将其添加到列表中,等等。要遍历该行,请使用:
links = soup.find_all('a href')
for link in links:
print(link)
您是否尝试过
webpage.xpath(xpath\u url+'/@href')
以下是完整的代码:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
webpage = html.fromstring(page.content)
webpage.xpath('//a/@href')
结果应该是:
[
'http://econpy.pythonanywhere.com/ex/002.html',
'http://econpy.pythonanywhere.com/ex/003.html',
'http://econpy.pythonanywhere.com/ex/004.html',
'http://econpy.pythonanywhere.com/ex/005.html'
]
使用上下文管理器的好处:
with requests_html.HTMLSession() as s:
try:
r = s.get('http://econpy.pythonanywhere.com/ex/001.html')
links = r.html.links
for link in links:
print(link)
except:
pass
你可以很容易地用硒
link = webpage.find_elemnt_by_xpath(*xpath url to element with link)
url = link.get_attribute('href')
您能提供xpath\u url的值吗?在第一行,xpath的解释似乎正确,但下面的xpath语句可能不正确。@jeedo您的评论帮助我意识到我的xpath以“div/h2/a”结束,因此根据jeremija的回答添加
/@href
就足够了。谢谢,谢谢@href
有效。现在我需要去了解为什么文本是text()
,href是@href
。我相信这是因为@
用于引用元素的属性,而text()
返回所选节点的内容。bs4似乎是一种流行的方法。在本例中,我希望继续使用python请求,但这对于将来的参考肯定很有用。非常感谢。
with requests_html.HTMLSession() as s:
try:
r = s.get('http://econpy.pythonanywhere.com/ex/001.html')
links = r.html.links
for link in links:
print(link)
except:
pass
link = webpage.find_elemnt_by_xpath(*xpath url to element with link)
url = link.get_attribute('href')