如何使用Selenium和Python从分页中的所有页面中刮取所有25个漫画

如何使用Selenium和Python从分页中的所有页面中刮取所有25个漫画,python,selenium,xpath,pagination,webdriverwait,Python,Selenium,Xpath,Pagination,Webdriverwait,我正在浏览这个网站 如果你一直向下滚动,你会发现一个带有分页的浏览漫画部分 我想从第1-5页中删掉所有25本漫画 这是我目前拥有的代码 从selenium导入webdriver 从bs4导入BeautifulSoup 从selenium.webdriver.common.keys导入密钥 从selenium.webdriver.support.ui导入WebDriverWait 从selenium.webdriver.support将预期的_条件导入为EC 从selenium.webdriver

我正在浏览这个网站

如果你一直向下滚动,你会发现一个带有分页的
浏览漫画部分

我想从第1-5页中删掉所有25本漫画

这是我目前拥有的代码

从selenium导入webdriver
从bs4导入BeautifulSoup
从selenium.webdriver.common.keys导入密钥
从selenium.webdriver.support.ui导入WebDriverWait
从selenium.webdriver.support将预期的_条件导入为EC
从selenium.webdriver.common.by导入
导入时间
类刮刀():
漫画(网址)https://www.dccomics.com/comics"
driver=webdriver.Chrome(“C:\\laragon\\www\\Proftaak\\chromedriver.exe”)
#driver=webdriver.Chrome(“C:\\laragon\\www\\proftaak-2020\\proftaak scraper\\chromedriver.exe”)
获取驱动程序(漫画url)
驱动程序。隐式等待(500)
当前页面=2
def GoToComic(自):
对于范围(1,3)内的i:
时间。睡眠(2)
goToComic=self.driver.通过xpath查找元素(f'/*[@id=“dcbrowseapp”]/div/div/div/div[3]/div[3]/div[2]/div[{i}]/a/img')
self.driver.execute_脚本(“参数[0]。单击();”,goToComic)
self.com()
self.driver.back()
self.ClearFilter()
如果self.current_页面!=6:
如果i==25:
self.current_页面+=1
self.ToNextPage()
def(自我):
self.driver.implicitly_wait(250)
title=[WebDriverWait中我的元素的my_elem.text(self.driver,5).until(所有元素的EC.visibility_位置((By.XPATH,//div[contains(@class,'page title')))))]
price=[my_elem.text for my_elem in WebDriverWait(self.driver,5).until(EC.visibility_of_all_element_位于((By.XPATH,//div[contains(@class,'buy container price'))]/span[contains(@class,'price')])]
available=[my_elem.text for my_elem in WebDriverWait(self.driver,5).until(EC.visibility_of_all_element_位于((By.XPATH,//div[contains(@class,'sale status container'))]/span[contains(@class,'sale status')])]
尝试:
description=[WebDriverWait(self.driver,5)中我的元素的my_elem.text.until(所有元素的可见性((By.CLASS_名称,“字段项”)))]
除:
回来
def ToNextPage(自我):
如果self.current_页面!=6:
nextPage=self.driver.find_element_by_xpath(f'/*[@id=“dcbrowseapp”]/div/div/div[3]/div[3]/div[1]/ul/li[{self.current_page}]/a')
self.driver.execute_脚本(“参数[0]。单击();”,下一页)
self.GoToComic()
def AcceptCookies(自我):
self.driver.implicitly_wait(250)
cookies=self.driver.通过xpath('/html/body/div[1]/div[2]/div[4]/div[2]/div/button')查找元素
self.driver.execute_脚本(“参数[0]。单击();”,cookies)
self.driver.implicitly_wait(100)
def ClearFilter(自):
self.driver.implicitly_wait(500)
clear\u filter=self.driver.通过类名称(“clear-all-action”)查找元素
self.driver.execute_脚本(“参数[0]。单击();”,清除_筛选器)
def驱动程序(自身):
self.driver.quit()
刮刀
scraper.AcceptCookies()
scraper.ClearFilter()
1.GoToComic()
scraper.QuitDriver()
现在它将第一页的前25个漫画刮得很好,但是当我转到第二页时,问题出现了,它将第2页的第一个漫画刮得很好,但是当我从漫画返回到该页时,过滤器将重置,它将再次从第1页开始


我怎样才能使它从正确的页面恢复,或者在返回漫画页面之前,过滤器始终处于关闭状态?我尝试过使用会话/cookie之类的工具,但似乎过滤器没有以任何可能的方式保存

浏览器
返回
功能将您带到以前访问过的URL。在您提到的网站中,所有页面都使用单个URL(看起来它们是由JS加载到同一页面的,因此新的漫画页面不需要新的URL)

这就是为什么当你从第二页的第一本漫画书回来时,你只需重新加载
https://www.dccomics.com/comics
默认情况下加载第一页

我还可以看到,没有专门的控制从喜剧细节回到名单

因此,唯一的方法是将当前页面的编号存储在代码中的某个位置,然后在从漫画详细信息页面返回后切换到该具体编号。

网页中的浏览漫画部分没有5页面作为分页,而只有总3页面。要使用和从每个漫画中刮取名称,您必须对所有元素的可见性进行归纳()
,并且您可以使用以下基于:

  • 代码块:

    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import TimeoutException, ElementClickInterceptedException
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get('https://www.dccomics.com/comics')
    while True:
        try:
            time.sleep(5)
            print([my_elem.text for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(@class, 'browse-result')]/a//p[not(contains(@class, 'result-date'))]")))])
            WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//ul[@class='pagination']//li[@class='active']//following::li[1]/a"))).click()
            print("Navigating to the next page")
        except (TimeoutException, ElementClickInterceptedException):
            print("No more pages to browse")
            break
    driver.quit()
    
  • 控制台输出:

    ['PRIMER', 'DOOMSDAY CLOCK PART 2', 'CATWOMAN #22', 'ACTION COMICS #1022', 'BATMAN/SUPERMAN #9', 'BATMAN: GOTHAM NIGHTS #7', 'BATMAN: THE ADVENTURES CONTINUE #5', 'BIRDS OF PREY #1', 'CATWOMAN 80TH ANNIVERSARY 100-PAGE SUPER SPECTACULAR #1', 'DC GOES TO WAR', "DCEASED: HOPE AT WORLD'S END #2", 'DETECTIVE COMICS #1022', 'FAR SECTOR #6', "HARLEY QUINN: MAKE 'EM LAUGH #1", 'HOUSE OF WHISPERS #21', 'JOHN CONSTANTINE: HELLBLAZER #6', 'JUSTICE LEAGUE DARK #22', 'MARTIAN MANHUNTER: IDENTITY', 'SCOOBY-DOO, WHERE ARE YOU? #104', 'SHAZAM! #12', 'TEEN TITANS GO! TO CAMP #15', 'THE JOKER: 80 YEARS OF THE CLOWN PRINCE OF CRIME THE DELUXE EDITION', 'THE LAST GOD: TALES FROM THE BOOK OF AGES #1', 'THE TERRIFICS VOL. 3: THE GOD GAME', 'WONDER WOMAN #756']
    Navigating to the next page
    ['YOUNG JUSTICE VOL. 2: LOST IN THE MULTIVERSE', 'AMETHYST #3', 'BATMAN #92', 'DC CLASSICS: THE BATMAN ADVENTURES #1', 'DC COMICS: THE ASTONISHING ART OF AMANDA CONNER', 'DIAL H FOR HERO VOL. 2: NEW HEROES OF METROPOLIS', 'HARLEY QUINN #73', "HARLEY QUINN: MAKE 'EM LAUGH #2", 'JUSTICE LEAGUE #46', 'JUSTICE LEAGUE ODYSSEY #21', 'LEGION OF SUPER-HEROES #6', 'LOIS LANE #11', 'NIGHTWING #71', 'TEEN TITANS GO! TO CAMP #16', "THE BATMAN'S GRAVE #7", 'THE FLASH #755', 'THE FLASH VOL. 12: DEATH AND THE SPEED FORCE', 'THE JOKER 80TH ANNIVERSARY 100-PAGE SUPER SPECTACULAR #1', 'YEAR OF THE VILLAIN: HELL ARISEN', 'YOUNG JUSTICE #15', 'SUPERMAN #22', 'BATMAN SECRET FILES #3', 'WONDER WOMAN: TEMPEST TOSSED', 'HAWKMAN #24', 'JOKER: THE DELUXE EDITION']
    Navigating to the next page
    ['METAL MEN #7', 'NIGHTWING ANNUAL #3', 'BATGIRL VOL. 7: ORACLE RISING', 'BATMAN & THE OUTSIDERS #13', 'BATMAN: GOTHAM NIGHTS #9', 'CATWOMAN VOL. 3: FRIEND OR FOE?', 'DAPHNE BYRNE #5', "DCEASED: HOPE AT WORLD'S END #3", 'STRANGE ADVENTURES #2', 'THE FLASH ANNUAL (REBIRTH) #3', 'THE GREEN LANTERN SEASON TWO #4', 'THE QUESTION: THE DEATHS OF VIC SAGE #3', 'WONDER WOMAN #757', 'WONDER WOMAN: AGENT OF PEACE #6', 'WONDER WOMAN: DEAD EARTH #3', 'DARK NIGHTS: DEATH METAL #1', 'YOU BROUGHT ME THE OCEAN']
    No more pages to browse
    

是的,这似乎确实正确地删除了分页+标题,但当我点击漫画并删除价格+状态+描述时,这是否仍然有效。返回网页,继续分别抓取每个漫画。直到我把所有25本漫画都擦掉,然后继续下一页重复这个过程?@nielsvanhoof这是可以实现的,但需要额外的代码行和更结构化的逻辑。