如何使用python/selenium/BeautifulSoup在页面加载时刮取未完全加载的图像?
我正在尝试刮一个电子商务网站,我可以成功地刮除图像以外的所有数据。当我试图抓取图像时,我可以得到前3或4个图像URL,但其余显示占位符。这是我的密码:如何使用python/selenium/BeautifulSoup在页面加载时刮取未完全加载的图像?,python,selenium,web-scraping,beautifulsoup,Python,Selenium,Web Scraping,Beautifulsoup,我正在尝试刮一个电子商务网站,我可以成功地刮除图像以外的所有数据。当我试图抓取图像时,我可以得到前3或4个图像URL,但其余显示占位符。这是我的密码: import requests import bs4 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.we
import requests
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
url = 'https://pages.daraz.com.bd/'
offers = url + 'wow/gcp/daraz/megascenario/bd/ramadan_eidcampaign_april21/grocery_free_shipping'
driver = webdriver.Chrome(executable_path=r'D:\Py\Hive-Ecommerce\static\chromedriver.exe')
driver.get(offers)
output = []
wait = WebDriverWait(driver, 30)
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "product2-in-a-row-item")))
html = driver.page_source
soup = bs4.BeautifulSoup(html)
driver.close()
for product in soup.find_all("div", {"class": "product2-in-a-row-item"}):
image = product.find("img", {"class": "rax-image"})
title = product.find("span", {"class": "product-item-bottom-title"})
price = product.find_all("div", {"class": "lzd-price"})
discount = product.find_all("span", {"class": "text"})
link = product.find("a", {"class": "lzd-item"})
image = image['src']
productName = title.text
price = price[0].text if len(price) else 0
discount = discount[0].text if len(discount) else 0
link = link['href']
print(image)
有什么方法可以正确地刮取所有图像吗?您看到的数据是通过Ajax从外部URL加载的。您可以使用此示例了解如何加载图像 注意:脚本可能需要cookie的新值。当您打开Firefox开发者工具->网络选项卡时,您将看到那里的请求和所有参数/cookie:
导入json
导入请求
api_url=”https://acs-m.daraz.com.bd//h5/mtop.lazada.kangaroo.core.service.route.drzaldlampservice/1.0/"
cookies={
“_m_h5_tk”:“82c8b6ce7a958daa1f7ce6279854d666_1620564702086”,
“_m_h5_tk_enc”:“1A493ABF3BD09BC1254FDBD0974ECB”,
}
参数={
“jsv”:“2.5.1”,
“appKey”:“24936599”,
“t”:“1620555452560”,
“标志”:“DF68466735D67D47D4FD3CD15E4D58DBA7A”,
“api”:“mtop.lazada.kangaroo.core.service.route.drzAldLampService”,
“v”:“1.0”,
“类型”:“原始JSON”,
“isSec”:“1”,
“AntiCreep”:“true”,
“超时”:“20000”,
“数据类型”:“json”,
“sessionOption”:“仅自动登录”,
“x-i18n-language”:“en BD”,
“x-i18n-regionID”:“BD”,
“数据:”{“pageNo”:1,“pageSize”:30,“pageId”:80065583,“平台:”“pc”,“appId:“1472729”,“bizId:“1000003”,“terminalType:”0,“语言:”,“en”,“货币:”,“pkr”,“regionId:“BD”,“cna:”,“backupParams:“货币,地区id,terminalType,语言,id”,“uPVuUID:“curPageUrl:”"https://pages.daraz.com.bd/wow/gcp/daraz/megascenario/bd/ramadan_eidcampaign_april21/grocery_free_shipping“,“isbackup”:true}',
}
data=requests.get(api_url,params=params,cookies=cookies).json()
#取消对此的注释以打印所有数据:
#打印(json.dumps(数据,缩进=4))
对于数据[“数据”][“结果值”][“1472729”][“数据”]中的d:
印刷品(
“{:这是您使用selenium的解决方案。数据是从外部URL加载的,需要一些时间才能加载
import requests
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
url = 'https://pages.daraz.com.bd/'
offers = url + 'wow/gcp/daraz/megascenario/bd/ramadan_eidcampaign_april21/grocery_free_shipping'
driver = webdriver.Firefox(executable_path=r'*/*/geckodriver')
driver.get(offers)
wait = WebDriverWait(driver, 30)
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "product2-in-a-row-item")))
scrolls = 7
while True:
scrolls -= 1
print(scrolls)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(5)
if scrolls < 0:
break
html = driver.page_source
output = []
driver.close()
soup = bs4.BeautifulSoup(html, 'html.parser')
for product in soup.find_all("div", {"class": "product2-in-a-row-item"}):
image = product.find("img", {"class": "rax-image"})
title = product.find("span", {"class": "product-item-bottom-title"})
price = product.find_all("div", {"class": "lzd-price"})
discount = product.find_all("span", {"class": "text"})
links = product.select('img.rax-image[src]')[0]['src']
print(links)
if links.startswith(" https"):
print('link : ', links)
image = image['src']
productName = title.text
price = price[0].text if len(price) else 0
discount = discount[0].text if len(discount) else 0
导入请求
进口bs4
从selenium导入webdriver
从selenium.webdriver.common.by导入
从selenium.webdriver.support.ui导入WebDriverWait
从selenium.webdriver.support将预期的_条件导入为EC
导入时间
url='1〕https://pages.daraz.com.bd/'
优惠=url+“哇/gcp/daraz/megascenario/bd/ramadan\u eidcampaign\u april21/杂货店\u免费\u配送”
driver=webdriver.Firefox(可执行文件\u path=r'*/*/geckodriver')
驱动程序。获取(提供)
wait=WebDriverWait(驱动程序,30)
等待.直到(位于((By.CLASS\u名称,“product2-in-a-row-item”)的元素的EC.visibility\u)
卷轴=7
尽管如此:
卷轴-=1
打印(卷轴)
执行_脚本(“window.scrollTo(0,document.body.scrollHeight)”)
时间。睡眠(5)
如果滚动小于0:
打破
html=driver.page\u源
输出=[]
驱动程序关闭()
soup=bs4.BeautifulSoup(html'html.parser')
对于汤中的产品。查找所有(“div”,“class”:“product2-in-a-row-item”}):
image=product.find(“img”,“class”:“rax image”})
title=product.find(“span”,{“class”:“产品项底部标题”})
price=product.find_all(“div”,{“class”:“lzd price”})
折扣=产品。查找所有(“span”,“class”:“text”})
links=product.select('img.rax image[src]')[0]['src']
打印(链接)
如果links.startswith(“https”):
打印('链接:',链接)
image=image['src']
productName=title.text
价格=价格[0]。如果len(价格)为0,则显示文本
折扣=折扣[0]。如果len(折扣)为0,则显示文本
嘿,这实际上是一个很好的解决方案,不会对我的代码做太多更改,我需要这个向下滚动功能。但它仍然会打印一些占位符图像。嘿,这完全有效。不过,只有一个问题。我如何理解哪个api负责用产品填充页面?