如何使用python/selenium/BeautifulSoup在页面加载时刮取未完全加载的图像?

如何使用python/selenium/BeautifulSoup在页面加载时刮取未完全加载的图像?,python,selenium,web-scraping,beautifulsoup,Python,Selenium,Web Scraping,Beautifulsoup,我正在尝试刮一个电子商务网站,我可以成功地刮除图像以外的所有数据。当我试图抓取图像时,我可以得到前3或4个图像URL,但其余显示占位符。这是我的密码: import requests import bs4 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.we

我正在尝试刮一个电子商务网站,我可以成功地刮除图像以外的所有数据。当我试图抓取图像时,我可以得到前3或4个图像URL,但其余显示占位符。这是我的密码:

import requests
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = 'https://pages.daraz.com.bd/'
offers = url + 'wow/gcp/daraz/megascenario/bd/ramadan_eidcampaign_april21/grocery_free_shipping'
driver = webdriver.Chrome(executable_path=r'D:\Py\Hive-Ecommerce\static\chromedriver.exe')
driver.get(offers)
output = []
wait = WebDriverWait(driver, 30)
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "product2-in-a-row-item")))
html = driver.page_source
soup = bs4.BeautifulSoup(html)
driver.close()
for product in soup.find_all("div", {"class": "product2-in-a-row-item"}):
    image = product.find("img", {"class": "rax-image"})
    title = product.find("span", {"class": "product-item-bottom-title"})
    price = product.find_all("div", {"class": "lzd-price"})
    discount = product.find_all("span", {"class": "text"})
    link = product.find("a", {"class": "lzd-item"})
    image = image['src']
    productName = title.text
    price = price[0].text if len(price) else 0
    discount = discount[0].text if len(discount) else 0
    link = link['href']
    print(image)

有什么方法可以正确地刮取所有图像吗?

您看到的数据是通过Ajax从外部URL加载的。您可以使用此示例了解如何加载图像

注意:脚本可能需要cookie的新值。当您打开Firefox开发者工具->网络选项卡时,您将看到那里的请求和所有参数/cookie:

导入json
导入请求
api_url=”https://acs-m.daraz.com.bd//h5/mtop.lazada.kangaroo.core.service.route.drzaldlampservice/1.0/"
cookies={
“_m_h5_tk”:“82c8b6ce7a958daa1f7ce6279854d666_1620564702086”,
“_m_h5_tk_enc”:“1A493ABF3BD09BC1254FDBD0974ECB”,
}
参数={
“jsv”:“2.5.1”,
“appKey”:“24936599”,
“t”:“1620555452560”,
“标志”:“DF68466735D67D47D4FD3CD15E4D58DBA7A”,
“api”:“mtop.lazada.kangaroo.core.service.route.drzAldLampService”,
“v”:“1.0”,
“类型”:“原始JSON”,
“isSec”:“1”,
“AntiCreep”:“true”,
“超时”:“20000”,
“数据类型”:“json”,
“sessionOption”:“仅自动登录”,
“x-i18n-language”:“en BD”,
“x-i18n-regionID”:“BD”,
“数据:”{“pageNo”:1,“pageSize”:30,“pageId”:80065583,“平台:”“pc”,“appId:“1472729”,“bizId:“1000003”,“terminalType:”0,“语言:”,“en”,“货币:”,“pkr”,“regionId:“BD”,“cna:”,“backupParams:“货币,地区id,terminalType,语言,id”,“uPVuUID:“curPageUrl:”"https://pages.daraz.com.bd/wow/gcp/daraz/megascenario/bd/ramadan_eidcampaign_april21/grocery_free_shipping“,“isbackup”:true}',
}
data=requests.get(api_url,params=params,cookies=cookies).json()
#取消对此的注释以打印所有数据:
#打印(json.dumps(数据,缩进=4))
对于数据[“数据”][“结果值”][“1472729”][“数据”]中的d:
印刷品(

“{:这是您使用selenium的解决方案。数据是从外部URL加载的,需要一些时间才能加载

import requests
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
url = 'https://pages.daraz.com.bd/'
offers = url + 'wow/gcp/daraz/megascenario/bd/ramadan_eidcampaign_april21/grocery_free_shipping'
driver = webdriver.Firefox(executable_path=r'*/*/geckodriver')
driver.get(offers)

wait = WebDriverWait(driver, 30)
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "product2-in-a-row-item")))
scrolls = 7
while True:
    scrolls -= 1
    print(scrolls)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(5)
    if scrolls < 0:
        break
html = driver.page_source
output = []
driver.close()
soup = bs4.BeautifulSoup(html, 'html.parser')
for product in soup.find_all("div", {"class": "product2-in-a-row-item"}):
    image = product.find("img", {"class": "rax-image"})
    title = product.find("span", {"class": "product-item-bottom-title"})
    price = product.find_all("div", {"class": "lzd-price"})
    discount = product.find_all("span", {"class": "text"})

    links = product.select('img.rax-image[src]')[0]['src']
    print(links)
    if links.startswith(" https"):
        print('link : ', links)

    image = image['src']
    productName = title.text
    price = price[0].text if len(price) else 0
    discount = discount[0].text if len(discount) else 0
导入请求
进口bs4
从selenium导入webdriver
从selenium.webdriver.common.by导入
从selenium.webdriver.support.ui导入WebDriverWait
从selenium.webdriver.support将预期的_条件导入为EC
导入时间
url='1〕https://pages.daraz.com.bd/'
优惠=url+“哇/gcp/daraz/megascenario/bd/ramadan\u eidcampaign\u april21/杂货店\u免费\u配送”
driver=webdriver.Firefox(可执行文件\u path=r'*/*/geckodriver')
驱动程序。获取(提供)
wait=WebDriverWait(驱动程序,30)
等待.直到(位于((By.CLASS\u名称,“product2-in-a-row-item”)的元素的EC.visibility\u)
卷轴=7
尽管如此:
卷轴-=1
打印(卷轴)
执行_脚本(“window.scrollTo(0,document.body.scrollHeight)”)
时间。睡眠(5)
如果滚动小于0:
打破
html=driver.page\u源
输出=[]
驱动程序关闭()
soup=bs4.BeautifulSoup(html'html.parser')
对于汤中的产品。查找所有(“div”,“class”:“product2-in-a-row-item”}):
image=product.find(“img”,“class”:“rax image”})
title=product.find(“span”,{“class”:“产品项底部标题”})
price=product.find_all(“div”,{“class”:“lzd price”})
折扣=产品。查找所有(“span”,“class”:“text”})
links=product.select('img.rax image[src]')[0]['src']
打印(链接)
如果links.startswith(“https”):
打印('链接:',链接)
image=image['src']
productName=title.text
价格=价格[0]。如果len(价格)为0,则显示文本
折扣=折扣[0]。如果len(折扣)为0,则显示文本

嘿,这实际上是一个很好的解决方案,不会对我的代码做太多更改,我需要这个向下滚动功能。但它仍然会打印一些占位符图像。嘿,这完全有效。不过,只有一个问题。我如何理解哪个api负责用产品填充页面?