Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/306.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何确定从'yield scrapy.Request'返回的生成器是否有任何数据?_Python_Scrapy_Generator - Fatal编程技术网

Python 如何确定从'yield scrapy.Request'返回的生成器是否有任何数据?

Python 如何确定从'yield scrapy.Request'返回的生成器是否有任何数据?,python,scrapy,generator,Python,Scrapy,Generator,在中,爬行器从class=“next”中提取下一页链接并对其进行爬网- import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.c

在中,爬行器从
class=“next”
中提取下一页链接并对其进行爬网-

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)
就我而言,我无法在从Web服务器下载的文件中找到下一页链接,但我知道格式是
response.url
/page/[page number]/
连接。例如,请求的不产生引号的页面仍然返回
响应
。由于下一页的数量通常少于20页,我可以通过将spider的最后3行替换为-

for page_num in range(2, 20):
    yield response.follow(f"/page/{page_num}/", callback=self.parse)
但是,这会迫使爬行器请求不产生引号的页面(例如到20页)。在请求第一个不产生引号的页面后,如何调整spider以终止
page\u num
循环?(例如)

伪码-

    page_num = 2
    while (quotes are yielded from the response):
        yield response.follow(f"/page/{page_num}/", callback=self.parse)
        page_num += 1

您可以使用
response.css(“..”)
的结果作为下一页的条件。
在这种情况下,您的代码如下所示:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        page_num = get_pagenumber_from_url(response.url)
        
        quotes_sel = response.css('div.quote'):
        # quotes_sel - will be SelectorList if page have item data
        # quotes_sel - will be None if page doesn't have item data
        for quote in quotes_sel:
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        if quotes_sel:
            next_page_url = f"/page/{str(page_num+1)}"
            yield response.follow(next_page_url , callback=self.parse)


谢谢@Georgiy伟大的回答!但若页面中并没有项目数据,那个么您的评论中的一个小更正-
#quotes_sel-将是[]