Python 无限滚动页面加载时的网页抓取问题_Python_Selenium Webdriver_Web Scraping

Python 无限滚动页面加载时的网页抓取问题

python selenium-webdriver web-scraping

Python 无限滚动页面加载时的网页抓取问题,python,selenium-webdriver,web-scraping,Python,Selenium Webdriver,Web Scraping,我必须刮掉一个电子商务网站，它在第一页上加载45种产品，然后在滚动到页面末尾时加载额外的45种产品我正在使用Python作为Selenium Web驱动程序来抓取此页面 Ajax似乎会在每次后续重新加载时替换容器，因此无法在所有产品加载后提取所有数据附上参考代码。请指导我如何刮除所有产品 from selenium import webdriver from selenium.common.exceptions import NoSuchElementException import pan

我必须刮掉一个电子商务网站，它在第一页上加载45种产品，然后在滚动到页面末尾时加载额外的45种产品

我正在使用Python作为Selenium Web驱动程序来抓取此页面

Ajax似乎会在每次后续重新加载时替换容器，因此无法在所有产品加载后提取所有数据

附上参考代码。请指导我如何刮除所有产品

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import pandas
from numpy import long

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome()
html=driver.get("https://www.ajio.com/women-jackets-coats/c/830316012")
assert 'Ajio' in driver.title
content = driver.find_elements_by_class_name('item')
totalitems=long(driver.find_element_by_class_name('length').text.strip(' Items Found').replace(',','',10))

loop_count=int(((totalitems-len(content))/len(content)))

print(loop_count)

data=[]
row=['Brand','Description','Offer_Price','Original_Price','Discount']
data.append(row)

for i in range(1,loop_count):
    content = driver.find_elements_by_class_name('item') 
    print(i)
    print(len(content))

    for item in content:
        row=[]
        row.append(item.find_element_by_class_name('brand').text.strip())
        row.append(item.find_element_by_class_name('name').text.strip())
        row.append(item.find_element_by_class_name('price').text.strip().strip('Rs. '))
        try:
            row.append(item.find_element_by_class_name('orginal-price').text.strip('Rs. '))
        except NoSuchElementException as exception:
            row.append(item.find_element_by_class_name('price').text.strip('Rs. '))

        try:
            row.append(item.find_element_by_class_name('discount').text.strip())
        except NoSuchElementException as exception:
            row.append("No Discount")

        data.append(row)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight-850);")
    try:
        myElem = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CLASS_NAME, 'loader')))
    except TimeoutException:
        print("Loading took too much time!")



df = pandas.DataFrame(data)
df.to_csv(r"C:\Ajio.csv", sep=',',index=False, header=False, mode='w')   #mode='a' for append

听起来您面临的问题是，根据后续的重新加载/滚动，您正在刮取的数据不一致

一种解决方案是存储一个高于此函数范围的数据结构，该结构将记录您迄今为止看到的项目。当页面重新加载/滚动时，您可以检查数据结构中是否已经存在每个项目，以及是否没有将其添加到结构中，直到您可以确保已经点击了页面上所有可能的项目

祝你好运

嗨，看起来数据不是完全可用的，即使我在循环中完全滚动后刮掉所有项目。我很好，即使我得到重复的记录，因为我可以删除重复使用Excel。当我向下滚动并在网页上对第一个项目执行Ctrl+F组合键时，数据似乎正在div中重新加载，但找不到它。同样，当我在页面顶部向上滚动时，我无法找到最后一项（产品）。我不是web开发人员，因此不知道这项技术叫什么。如果有人能帮助识别正确方向的技术再次检查，如果我关闭网络/互联网，然后向上或向下滚动页面，所有先前下载的数据都会显示在屏幕上，这意味着数据在页面的某个位置，并在屏幕上填充。还是来自缓存？我还没有找到任何解决办法。我已经设法解决了这个问题。我没有在这里使用Selenium，而是使用请求加载API，该API返回JSON，然后读取JSON。这就解决了我的废纸问题。读取JSON要比使用Selenium读取网站快得多，我将page no作为参数传递给JSON API以加载下一页数据。