Python Scrapy不会单击所有页面_Python_Selenium_Scrapy

Python Scrapy不会单击所有页面

python selenium scrapy

Python Scrapy不会单击所有页面,python,selenium,scrapy,Python,Selenium,Scrapy,我正在用Scrapy在一家网店里爬行。产品是动态加载的，这就是为什么我使用Selenium在页面中爬行。我开始清理所有类别，然后调用这些类别作为主函数在爬行每个类别时都会出现问题：爬行器会被指示从第一页中刮取所有数据，然后单击按钮进入下一页，直到没有按钮为止。如果我只是将一个类别url作为start\u url放入，那么代码运行良好，但奇怪的是，如果我在主代码中运行它，它不会点击所有页面。在完成单击所有下一步按钮之前，它会随机切换到一个新类别我不知道为什么会这样 import scrapy

我正在用Scrapy在一家网店里爬行。产品是动态加载的，这就是为什么我使用Selenium在页面中爬行。我开始清理所有类别，然后调用这些类别作为主函数

在爬行每个类别时都会出现问题：爬行器会被指示从第一页中刮取所有数据，然后单击按钮进入下一页，直到没有按钮为止。如果我只是将一个类别url作为

start\u url

放入，那么代码运行良好，但奇怪的是，如果我在主代码中运行它，它不会点击所有页面。在完成单击所有下一步按钮之前，它会随机切换到一个新类别

我不知道为什么会这样

import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from horni.items import HorniItem

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys

class horniSpider(scrapy.Spider):
    name = "final"
    allowed_domains = ["example.com"]
    start_urls = ['https://www.example.com']

    def parse(self, response):
        for post in response.xpath('//body'):
            item = HorniItem()
            for href in response.xpath('//li[@class="sub"]/a/@href'):
                item['maincategory'] = response.urljoin(href.extract())
                yield scrapy.Request(item['maincategory'], callback = self.parse_subcategories)

    def parse_subcategories(self, response):
        item = HorniItem()
        for href in response.xpath('//li[@class="sub"]/a/@href'):
            item['subcategory'] = response.urljoin(href.extract())
            yield scrapy.Request(item['subcategory'], callback = self.parse_articles)


    def __init__(self):
            self.driver = webdriver.Chrome()
            dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
            self.driver.close()

    def parse_articles(self, response):
            self.driver.get(response.url)
            response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
            item = HorniItem()
            for sel in response.xpath('//body'):
                item['title'] = sel.xpath('//div[@id="article-list-headline"]/div/h1/text()').extract()
                yield item
            for post in response.xpath('//body'):
            id = post.xpath('//a[@class="title-link"]/@href').extract()
            prices = post.xpath('///span[@class="price ng-binding"]/text()').extract()
                articles = post.xpath('//a[@class="title-link"]/span[normalize-space()]/text()').extract()
                id = [i.split('/')[-2] for i in id]
            prices = [x for x in prices if x != u'\xa0']
                articles = [w.replace(u'\n', '') for w in articles]
                result = zip(id, prices, articles)
                for id, price, article in result:
                        item = HorniItem()
                        item['id'] = id
                item['price'] = price
                        item['name'] = article
                        yield item
            while True:
                next = self.driver.find_element_by_xpath('//div[@class="paging-wrapper"]/a[@class="paging-btn right"]')
                try:
                        next.click()
                    response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
                item = HorniItem()
                    for post in response.xpath('//body'):
                    id = post.xpath('//a[@class="title-link"]/@href').extract()
                    prices = post.xpath('///span[@class="price ng-binding"]/text()').extract()
                        articles = post.xpath('//a[@class="title-link"]/span[normalize-space()]/text()').extract()
                        id = [i.split('/')[-2] for i in id]
                    prices = [x for x in prices if x != u'\xa0']
                        articles = [w.replace(u'\n', '') for w in articles]
                        result = zip(id, prices, articles)
                        for id, price, article in result:
                            item = HorniItem()
                                item['id'] = id
                        item['price'] = price
                                item['name'] = article
                                yield item
                except:
                        break

更新

因此，问题似乎在于

下载延迟设置。由于网站上的“下一步”按钮实际上不会生成新的url，而只是执行Java脚本，因此网站url不会更改
 我找到了一个答案：
问题在于，由于页面内容是动态生成的，因此单击NEXT
-按钮实际上并没有更改url。与项目的DOWNLOAD\u DELAY
-设置相结合，这意味着爬行器在页面上停留一定的时间，不管它是否能够单击所有可能的NEXT
-按钮
设置下载延迟
-设置得足够高可以让爬行器在每个url上停留足够长的时间并抓取每个页面
但问题是，这会迫使爬行器在每个url上等待设置的时间，即使没有要单击的NEXT
-按钮。但是