Python Selenium无限滚动-重新刮削_Python_Selenium

Python Selenium无限滚动-重新刮削

python selenium

Python Selenium无限滚动-重新刮削,python,selenium,Python,Selenium,我已经构建了一个使用selenium的脚本，它工作得很好，但是我正在抓取的站点会无限加载，因此内置了一些东西来管理它但是，每次向下滚动时，它都会重新刮取以前刮取的数据如何将脚本更改为只刮取尚未刮取的数据我看到了一些类似的问题，并根据这些问题添加了一些代码，但我认为我的情况略有不同谢谢 from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selen

我已经构建了一个使用selenium的脚本，它工作得很好，但是我正在抓取的站点会无限加载，因此内置了一些东西来管理它

但是，每次向下滚动时，它都会重新刮取以前刮取的数据

如何将脚本更改为只刮取尚未刮取的数据

我看到了一些类似的问题，并根据这些问题添加了一些代码，但我认为我的情况略有不同

谢谢

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options
import time
import os
import csv

browser = webdriver.Chrome(executable_path="/chromedriver")
browser.get("***url***")

filename ="fileName.csv"
f = open(filename, 'w')
headers ="Title, Date, Time\n "
f.write(headers)

browser.find_element_by_css_selector('').click()
time.sleep(3)
page = browser.find_elements_by_class_name('')

# Get scroll height
last_height = browser.execute_script("return document.body.scrollHeight")

t_end = time.time() + 60
while time.time() < t_end:
    try:

        for items in page:

            title = items.find_element_by_class_name('').text.replace(',', '|')
            date = items.find_element_by_class_name('').text

            print('Name:',title)
            print('Date:',date)
            print("")

            f.write(title + "," + date.split(" ")[0] + "," + date.split(" ")[1] + "\n")

            # Scroll down to bottom
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5)
        page = browser.find_elements_by_class_name('')

    except:

        break

f.close()

browser.quit()

从selenium导入webdriver
从selenium.common.Exception导入NoTouchElementException
从selenium.webdriver.chrome.options导入选项
导入时间
导入操作系统
导入csv
browser=webdriver.Chrome（可执行文件路径=“/chromedriver”）
browser.get（“***url***”）
filename=“filename.csv”
f=打开（文件名为“w”）
headers=“标题、日期、时间\n”
f、 写入（标题）
浏览器。通过css选择器（“”）查找元素。单击（）
时间。睡眠（3）
页面=浏览器。按类名称（“”）查找元素
#获取滚动高度
last\u height=browser.execute\u脚本（“return document.body.scrollHeight”）
t_end=time.time（）+60
当time.time（）结束时：
尝试：
对于第页中的项目：
title=items.find_element_by_class_name（“”）.text.replace（‘，‘，‘‘‘，‘‘‘）
日期=项。按类名称（“”）查找元素。文本
打印（'名称：'，标题）
打印（'日期：'，日期）
打印（“”）
f、 写入（标题+”，“+日期分割（”[0]+”，“+日期分割（”[1]+“\n”）
#向下滚动至底部
browser.execute_脚本（“window.scrollTo（0，document.body.scrollHeight）；”）
时间。睡眠（5）
页面=浏览器。按类名称（“”）查找元素
除：
打破
f、 关闭（）
browser.quit（）

下面是一个示例，它将处理烧焦问题，直到加载所有动态行，然后废弃页面。确保添加

导入时间

driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer =driver.find_element_by_css_selector("div.footer")
preY =0
while footer.rect['y']!=preY:
    preY = footer.rect['y']
    footer.location_once_scrolled_into_view
    time.sleep(1)
print(str(driver.page_source))

你能分享这个链接吗？我想我有一个解决方案，但我需要在一些东西上测试它。为什么你不能一直往下，直到所有的页面都加载完毕，然后将其废弃呢。