Scrapy 我如何安排刮痧蜘蛛在特定时间后爬行？_Scrapy_Scrapy Spider

Scrapy 我如何安排刮痧蜘蛛在特定时间后爬行？

scrapy

Scrapy 我如何安排刮痧蜘蛛在特定时间后爬行？,scrapy,scrapy-spider,Scrapy,Scrapy Spider,我想安排我的蜘蛛在爬行完成后1小时内再次运行。在我的代码中，爬行结束后正在调用spider\u closed方法。现在，如何使用此方法再次运行spider。或者是否有任何可用的设置来安排刮擦式蜘蛛这是我的基本蜘蛛代码 import scrapy import codecs from a2i.items import A2iItem from scrapy.selector import Selector from scrapy.http import HtmlResponse from scr

我想安排我的蜘蛛在爬行完成后1小时内再次运行。在我的代码

中，爬行结束后正在调用spider\u closed

方法。现在，如何使用此方法再次运行spider。或者是否有任何可用的设置来安排刮擦式蜘蛛

这是我的基本蜘蛛代码

import scrapy
import codecs
from a2i.items import A2iItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher


class A2iSpider(scrapy.Spider):
    name = "notice"
    f = open("urls.txt")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()
    allowed_domains = ["prothom-alo.com"]

    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def parse(self, response):

        for href in response.css("a::attr('href')"):
            url = response.urljoin(href.extract())
            print "*"*70
            print url
            print "\n\n"
            yield scrapy.Request(url, callback=self.parse_page,meta={'depth':2,'url' : url})


    def parse_page(self, response):
        filename = "response.txt"
        depth = response.meta['depth']

        with open(filename, 'a') as f:
            f.write(str(depth))
            f.write("\n")
            f.write(response.meta['url'])
            f.write("\n")

        for href in response.css("a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_page,meta={'depth':depth+1,'url' : url})


    def spider_closed(self, spider):
        print "$"*2000

你可以用

crontab-e

创建计划并以root用户身份运行脚本，或

crontab-u[user]-e

以特定用户身份运行

在底部，您可以添加

0****cd/path/to/your/scrapy&&scrapy crawl[yourrapy]>>/path/to/log/scrapy\u log.log

0****

使脚本每小时运行一次，我相信您可以在线找到有关设置的更多详细信息。

您可以使用JOBDIR设置运行spider，它将保存加载到计划程序中的请求

scrapy crawl somespider -s JOBDIR=crawls/somespider-1