Python 我可以暂停selenium中的滚动功能，刮取当前数据，然后在脚本中继续滚动吗？_Python_Selenium_Selenium Webdriver_Web Scraping

Python 我可以暂停selenium中的滚动功能，刮取当前数据，然后在脚本中继续滚动吗？

python selenium selenium-webdriver web-scraping

Python 我可以暂停selenium中的滚动功能，刮取当前数据，然后在脚本中继续滚动吗？,python,selenium,selenium-webdriver,web-scraping,Python,Selenium,Selenium Webdriver,Web Scraping,我是一名学生，正在做一个刮片项目，我很难完成我的脚本，因为它用所有的数据存储填充了我的计算机内存它目前存储我所有的数据，直到最后，所以我的解决方案是将这些数据分成小块，然后周期性地写出数据，这样它就不会继续制作一个大列表，然后在最后写出为了做到这一点，我需要停止我的滚动方法，刮去加载的配置文件，写出我收集的数据，然后在不复制数据的情况下重复这个过程。如果有人能告诉我怎么做，我将不胜感激。谢谢你的帮助：）以下是我当前的代码： from selenium import webdriver fr

我是一名学生，正在做一个刮片项目，我很难完成我的脚本，因为它用所有的数据存储填充了我的计算机内存

它目前存储我所有的数据，直到最后，所以我的解决方案是将这些数据分成小块，然后周期性地写出数据，这样它就不会继续制作一个大列表，然后在最后写出

为了做到这一点，我需要停止我的滚动方法，刮去加载的配置文件，写出我收集的数据，然后在不复制数据的情况下重复这个过程。如果有人能告诉我怎么做，我将不胜感激。谢谢你的帮助：）

以下是我当前的代码：

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.common.exceptions import NoSuchElementException


Data = []

driver = webdriver.Chrome()
driver.get("https://directory.bcsp.org/")
count = int(input("Number of Pages to Scrape: "))

body = driver.find_element_by_xpath("//body") 
profile_count = driver.find_elements_by_xpath("//div[@align='right']/a")

while len(profile_count) < count:   # Get links up to "count"
    body.send_keys(Keys.END)
    sleep(1)
    profile_count = driver.find_elements_by_xpath("//div[@align='right']/a")

for link in profile_count:   # Calling up links
    temp = link.get_attribute('href')   # temp for
    driver.execute_script("window.open('');")   # open new tab
    driver.switch_to.window(driver.window_handles[1])   # focus new tab
    driver.get(temp)

    # scrape code

    Name = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[1]/div[2]/div').text
    IssuedBy = "Board of Certified Safety Professionals"
    CertificationorDesignaationNumber = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[1]/td[3]/div[2]').text
    CertfiedorDesignatedSince = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[3]/td[1]/div[2]').text
    try:
        AccreditedBy = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[3]/div[2]/a').text

    except NoSuchElementException:
        AccreditedBy = "N/A"

    try:
        Expires = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[1]/div[2]').text

    except NoSuchElementException:
        Expires = "N/A"

    info = Name, IssuedBy, CertificationorDesignaationNumber, CertfiedorDesignatedSince, AccreditedBy, Expires + "\n"

    Data.extend(info)
    driver.close()
    driver.switch_to.window(driver.window_handles[0])


with open("Spredsheet.txt", "w") as output:
    output.write(','.join(Data))

driver.close()
Test.py
Displaying Test.py.

从selenium导入webdriver
从selenium.webdriver.common.keys导入密钥
从时间上导入睡眠
从selenium.common.Exception导入NoTouchElementException
数据=[]
driver=webdriver.Chrome（）
驱动程序。获取（“https://directory.bcsp.org/")
count=int（输入（“要刮取的页数：”）
body=驱动程序。通过xpath（“body”）查找元素
profile_count=driver.find_elements_by_xpath（“//div[@align='right']/a”）
而len（profile_count）

使用请求和美化组尝试以下方法。在下面的脚本中，我使用了从网站本身获取的用于ex的API URL：-

首先，它将为第一次迭代创建URL（参考第一个URL），在.csv文件中添加标题和数据

第二次迭代它将再次创建URL（参考第二个URL）使用两个额外参数start_on_page=20和show_per_page=20，其中start_on_page数字20在每次迭代中递增20，show_per_page=100默认为每次迭代提取100条记录，依此类推，直到所有数据转储到.csv文件

脚本转储4个内容：编号、名称、位置和配置文件url

在每次迭代中，数据都会附加到.csv文件中，所以您的内存问题将通过这种方法得到解决
在运行脚本之前，不要忘记将系统路径添加到文件\u path变量中，您希望在其中创建.csv文件

import requests from urllib3.exceptions import InsecureRequestWarning requests.packages.urllib3.disable_warnings(InsecureRequestWarning) from bs4 import BeautifulSoup as bs import csv def scrap_directory_data(): list_of_credentials = [] file_path = '' file_name = 'credential_list.csv' count = 0 page_number = 0 page_size = 100 create_url = '' main_url = 'https://directory.bcsp.org/search_results.php?' first_iteration_url = 'first_name=&last_name=&city=&state=&country=&certification=&unauthorized=0&retired=0&specialties=&industries=' number_of_records = 0 csv_headers = ['#','Name','Location','Profile URL'] while True: if count == 0: create_url = main_url + first_iteration_url print('-' * 100) print('1 iteration URL created: ' + create_url) print('-' * 100) else: create_url = main_url + 'start_on_page=' + str(page_number) + '&show_per_page=' + str(page_size) + '&' + first_iteration_url print('-' * 100) print('Other then first iteration URL created: ' + create_url) print('-' * 100) page = requests.get(create_url,verify=False) extracted_text = bs(page.text, 'lxml') result = extracted_text.find_all('tr') if len(result) > 0: for idx, data in enumerate(result): if idx > 0: number_of_records +=1 name = data.contents[1].text location = data.contents[3].text profile_url = data.contents[5].contents[0].attrs['href'] list_of_credentials.append({ '#':number_of_records, 'Name':name, 'Location': location, 'Profile URL': profile_url }) print(data) with open(file_path + file_name ,'a+') as cred_CSV: csvwriter = csv.DictWriter(cred_CSV, delimiter=',',lineterminator='\n',fieldnames=csv_headers) if idx == 0 and count == 0: print('Writing CSV header now...') csvwriter.writeheader() else: for item in list_of_credentials: print('Writing data rows now..') print(item) csvwriter.writerow(item) list_of_credentials = [] count +=1 page_number +=20 scrap_directory_data()