Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 经过一定数量的请求后,如何阻止刮痧蜘蛛?_Python_Python 2.7_Loops_Python 3.x_Scrapy - Fatal编程技术网

Python 经过一定数量的请求后,如何阻止刮痧蜘蛛?

Python 经过一定数量的请求后,如何阻止刮痧蜘蛛?,python,python-2.7,loops,python-3.x,scrapy,Python,Python 2.7,Loops,Python 3.x,Scrapy,我正在开发一个简单的刮板来获取9个gag post及其图像,但由于一些技术困难,我无法停止刮板,它继续刮板,这是我不想要的。我想增加计数器值,并在100个post后停止。 但9gag页面的设计方式是,每次响应只给出10条帖子,每次迭代后,我的计数器值重置为10。在这种情况下,我的循环运行了很长时间,从未停止过 items.py的代码在这里 from scrapy.item import Item, Field class GagItem(Item): entry_id = Fiel

我正在开发一个简单的刮板来获取9个gag post及其图像,但由于一些技术困难,我无法停止刮板,它继续刮板,这是我不想要的。我想增加计数器值,并在100个post后停止。 但9gag页面的设计方式是,每次响应只给出10条帖子,每次迭代后,我的计数器值重置为10。在这种情况下,我的循环运行了很长时间,从未停止过


items.py的代码在这里

from scrapy.item import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()
所以我想增加一个全局计数值,并尝试通过传递3个参数来解析它给出的错误函数

TypeError: parse() takes exactly 3 arguments (2 given)
因此,有没有一种方法可以传递一个全局计数值,并在每次迭代后返回该值,然后在100篇文章后停止(假设)

整个项目在这里可用 即使我设置POST_LIMIT=100,也会发生无限循环,请参见此处我执行的命令

scrapy crawl first -s POST_LIMIT=10 --output=output.json

count
对于
parse()
方法是本地的,因此它不会在页面之间保留。将所有出现的
count
更改为
self.count
,使其成为类的实例变量,它将在两个页面之间保持不变。

爬行器参数通过爬网命令使用-a选项传递。首先检查使用
self.count
并在
解析
之外初始化。然后不要阻止对项目的解析,而是生成新的
请求。请参阅以下代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()


class FirstSpider(scrapy.Spider):

    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = ('http://www.9gag.com/', )

    last_gag_id = None
    COUNT_MAX = 30
    count = 0

    def parse(self, response):

        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            ninegag_item = GagItem()
            ninegag_item['entry_id'] = gag_id[0]
            ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
            ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
            ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
            ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
            ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
            self.last_gag_id = gag_id[0]
            self.count = self.count + 1
            yield ninegag_item

        if (self.count < self.COUNT_MAX):
            next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
            yield scrapy.Request(url=next_url, callback=self.parse)
#-*-编码:utf-8-*-
进口羊瘙痒
从刮擦导入项目,字段
类别项目(项目):
条目\u id=字段()
url=字段()
投票数=字段()
注释=字段()
标题=字段()
img_url=字段()
第一类蜘蛛(刮毛蜘蛛):
name=“first”
允许的_域=[“9gag.com”]
起始URL=('http://www.9gag.com/', )
last_gag_id=无
计数_MAX=30
计数=0
def解析(自我,响应):
对于response.xpath(“//article”)中的项目:
gag_id=article.xpath('@data entry id').extract()
ninegag_项目=GagItem()
ninegag_项目['entry_id']=gag_id[0]
ninegag_项['url']=article.xpath('@data entry url').extract()[0]
ninegag_item['voces']=article.xpath('@data entry voces').extract()[0]
ninegag_项['comments']=article.xpath('@data entry comments').extract()[0]
ninegag_item['title']=article.xpath('.//h2/a/text()')。extract()[0]。strip()
ninegag_item['img_url']=article.xpath('.//div[1]/a/img/@src').extract()
self.last\u gag\u id=gag\u id[0]
self.count=self.count+1
产量ninegag_项目
如果(自计数<自计数\u最大值):
下一步http://9gag.com/?id=%s&c=10“%self.last\u gag\u id”
产生scrapy.Request(url=next\uURL,callback=self.parse)

有一个内置设置,可以通过命令行
-s
参数传递,也可以在设置中更改:
scrapy crawl-s CLOSESPIDER_PAGECOUNT=100

一个小小的警告是,如果启用了缓存,它也会将缓存命中计数为页面计数。

可以使用 如下图所示

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()


class FirstSpider(scrapy.Spider):

    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = ('http://www.9gag.com/', )
    last_gag_id = None

    COUNT_MAX = 30

    custom_settings = {
        'CLOSESPIDER_PAGECOUNT': COUNT_MAX
    }

    def parse(self, response):

        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            ninegag_item = GagItem()
            ninegag_item['entry_id'] = gag_id[0]
            ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
            ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
            ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
            self.last_gag_id = gag_id[0]
            yield ninegag_item

            next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
            yield scrapy.Request(url=next_url, callback=self.parse)

有没有办法找到完成刮片所需的时间?工作非常好Thankx@Frank如果计数器设置为实例变量而不是类变量,不是更好吗?
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field


class GagItem(Item):
    entry_id = Field()
    url = Field()
    votes = Field()
    comments = Field()
    title = Field()
    img_url = Field()


class FirstSpider(scrapy.Spider):

    name = "first"
    allowed_domains = ["9gag.com"]
    start_urls = ('http://www.9gag.com/', )
    last_gag_id = None

    COUNT_MAX = 30

    custom_settings = {
        'CLOSESPIDER_PAGECOUNT': COUNT_MAX
    }

    def parse(self, response):

        for article in response.xpath('//article'):
            gag_id = article.xpath('@data-entry-id').extract()
            ninegag_item = GagItem()
            ninegag_item['entry_id'] = gag_id[0]
            ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
            ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
            ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
            self.last_gag_id = gag_id[0]
            yield ninegag_item

            next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
            yield scrapy.Request(url=next_url, callback=self.parse)