Python 3.x 使用XPath将图像URL提取为字符串_Python 3.x_Xpath_Web Scraping

Python 3.x 使用XPath将图像URL提取为字符串

python-3.x xpath web-scraping

Python 3.x 使用XPath将图像URL提取为字符串,python-3.x,xpath,web-scraping,Python 3.x,Xpath,Web Scraping,我无法使用xpath从flipkart提取产品图像url 网址：目标是提取src包含的图像url 在这种情况下：应该是输出我使用的Xpath是： //*[@class="_2rDnao"]//img[@src] 在chrome xpath helper中使用上面的xpath可以提供所需的输出，但在python脚本中使用时，它是空白的 import requests from lxml import html import os request_headers = { "Accept-L

我无法使用xpath从flipkart提取产品图像url

网址：

目标是提取src包含的图像url

在这种情况下：应该是输出

我使用的Xpath是：

//*[@class="_2rDnao"]//img[@src]

在chrome xpath helper中使用上面的xpath可以提供所需的输出，但在python脚本中使用时，它是空白的

import requests
from lxml import html
import os


request_headers = {
"Accept-Language": "en-US,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0.15063; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": "http://thewebsite.com",
"Connection": "keep-alive" 
}


webpage=requests.get("https://www.flipkart.com/savehatke/p/itmea2aspwcaxuaz? 
pid=ACCEA2ASHNDGV4DP", headers=request_headers)
tree = html.fromstring(webpage.content)
raw_img=tree.xpath('//*[@class="_2rDnao"]//img')

编辑：添加了python代码

注意：此解决方案基于Selenium xpath是正确的。您必须使用get_属性来获取文本

imgElement = driver.find_element_by_xpath("//*[@class='_2rDnao']//img")
print(imgElement.get_attribute('src'))

输出是,

图像url也位于底部包含json的脚本中

导入请求
从bs4导入BeautifulSoup
导入json
r=请求。获取（'https://www.flipkart.com/f-d-f550x-56-w-bluetooth-home-theatre/p/itmea2aspwcaxuaz?pid=ACCEA2ASHNDGV4DP')
soup=BeautifulSoup（r.text'html.parser'）
script=soup.find（id='jsonLD'）
json=json.load（script.text）
对于json中的obj：
如果obj['@type']=='Product':
url=obj['image']
打印（url）

输出为

http://rukmini1.flixcart.com/image/128/128/speaker/home-audio-speaker/4/d/p/f-d-a550x-original-imaea2ftzywquzrz.jpeg?q=70

即使通过xpath检查页面，我也看不到相同的维度。如果您不介意大小上的一些变化（您可以随时调整尺寸），那么从response.text中regex就很容易了

import requests, re

r = requests.get('https://www.flipkart.com/f-d-f550x-56-w-bluetooth-home-theatre/p/itmea2aspwcaxuaz?pid=ACCEA2ASHNDGV4DP')
p = re.compile(r'image":"(.*?)"')
print(p.findall(r.text)[0])

代码的其余部分在哪里？你在使用selenium吗？@QHarr我已经添加了代码。如果不使用Beautiful soup，我们还有其他选择吗？我在网站上找不到收集这些数据的api，所以这是我能想到的最简单的解决方案。您可以始终使用硒（其他答案），但这更麻烦，而且更容易实现。您只需在命令行中运行

pip install requests

和

pip install bs4

，此解决方案即可工作。