Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/selenium/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何将APscheduler与scrapy一起使用_Python_Scrapy_Twisted_Apscheduler - Fatal编程技术网

Python 如何将APscheduler与scrapy一起使用

Python 如何将APscheduler与scrapy一起使用,python,scrapy,twisted,apscheduler,Python,Scrapy,Twisted,Apscheduler,让这些代码从脚本()运行scrapy crawler。但它不起作用 from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy import log,signals from spiders.egov import EgovSpider from scrapy.utils.project import get_project_settings def run(): spider

让这些代码从脚本()运行scrapy crawler。但它不起作用

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log,signals
from spiders.egov import EgovSpider
from scrapy.utils.project import get_project_settings

def run():
    spider =EgovSpider()
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configured
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()


from apscheduler.schedulers.twisted import TwistedScheduler
sched = TwistedScheduler()
sched.add_job(run, 'interval', seconds=10)
sched.start()
我的蜘蛛:

import scrapy

class EgovSpider(scrapy.Spider):
    name = 'egov'
    start_urls = ['http://egov-buryatia.ru/index.php?id=1493']


    def parse(self, response):

        data = response.xpath("//div[@id='main_wrapper_content_news']//tr//text()").extract()
        print data
        print response.url
        f = open("vac.txt","a")
        for d in data:
            f.write(d.encode(encoding="UTF-8") + "\n")

        f.write(str(now))
        f.close()
如果替换行“reactor.run()”,则spider在10秒后启动了一次:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log,signals
from spiders.egov import EgovSpider
from scrapy.utils.project import get_project_settings

def run():  
    spider =EgovSpider()
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configured
    crawler.crawl(spider)
    crawler.start()
    log.start()

from apscheduler.schedulers.twisted import TwistedScheduler
sched = TwistedScheduler()
sched.add_job(run, 'interval', seconds=10)
sched.start()
reactor.run()

我对python和英语经验不足:)请帮帮我。

我今天遇到了同样的问题。这里有一些信息

扭曲反应堆一旦运行和停止就无法重启。您应该启动一个长时间运行的反应器,并定期添加爬虫任务

为了进一步简化代码,可以使用CrawlerProcess.start(),其中包括reactor.run()


好的,那么最后一段代码怎么了?你说它在10秒后就开始了,就像它应该的那样,只启动了一次。不是每10秒一次。如果你还在寻找答案,我不久前写了一篇关于如何实现它的博客。您还需要版本0.24才能工作。如何使用多个Scrapy Spider?如何只运行一个实例?max_instances=1似乎不起作用。即使以前的爬网仍在运行,新的爬网也将开始。
from scrapy.crawler import CrawlerProcess
from spiders.egov import EgovSpider
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

process = CrawlerProcess(get_project_settings())
sched = TwistedScheduler()
sched.add_job(process.crawl, 'interval', args=[EgovSpider], seconds=10)
sched.start()
process.start(False)    # Do not stop reactor after spider closes