使用Scrapy作为RabbitMQ的使用者_Rabbitmq_Scrapy

使用Scrapy作为RabbitMQ的使用者

rabbitmq scrapy

使用Scrapy作为RabbitMQ的使用者,rabbitmq,scrapy,Rabbitmq,Scrapy,我试图使用Scrapy作为消费者，使用RabbitMQ 以下是我的代码片段： def runTester(body): spider = MySpider(domain=body["url"], body=body) settings = get_project_settings() crawler = Crawler(settings) crawler.signals.connect(reactor.stop, signal=signals.spider_clo

我试图使用Scrapy作为消费者，使用RabbitMQ

以下是我的代码片段：

def runTester(body):
    spider = MySpider(domain=body["url"], body=body)
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run()


def callback(ch, method, properties, body):
    body = json.loads(body)
    runTester(body)
    ch.basic_ack(delivery_tag=method.delivery_tag)

if __name__ == '__main__':
    connection = pika.BlockingConnection(pika.ConnectionParameters(host=settings.RABBITMQ_HOST))
    channel = connection.channel()
    channel.queue_declare(queue=settings.RABBITMQ_TESTER_QUEUE, durable=True)
    channel.basic_qos(prefetch_count=1)
    channel.basic_consume(callback, queue=settings.RABBITMQ_TESTER_QUEUE)
    channel.start_consuming()

正如您所看到的，当第一条消息被消耗并且spider运行时，问题是反应堆关闭。解决方法是什么

我希望能够在从RabbitMQ接收消息的同时保持反应堆运行，同时始终运行新的爬虫程序。

更好的方法是使用api启动爬虫，在收到爬虫请求后，您将使用如下方式：

reply = {}
args = ['curl',
        'http://localhost:6800/schedule.json',
        '-d', 'project=myproject', ] + flat_args
json_reply = subprocess.Popen(args, stdout=subprocess.PIPE).communicate()[0]
try:
    reply = json.loads(json_reply)
    if reply['status'] != 'ok':
        logger.error('Error in spider: %r: %r.', args, reply)
    else:
        logger.debug('Started spider: %r: %r.', args, reply)
except Exception:
    logger.error('Error starting spider: %r: %r.', args, json_reply)
return reply

启动子流程的实际功能是：

$ curl http://localhost:6800/schedule.json -d project=myproject -d spider=somespider

scrapy守护程序是为管理spider的启动而构建的，它还具有许多其他有用的功能，如使用简单的

scrapy deploy

命令部署新的spider版本，监视和平衡多个spider等。

这是可行的，但它不是立即运行scrapy进程，而是在一段时间后运行。你能让我知道我们如何在计划好蜘蛛后立即运行它吗？不，我错了。工作正常谢谢你的建议。：）问题是关于从RabbitMQ消费，而不是向scrapyd提交作业。我也很想知道如何将scrapyd与RabbitMQ结合使用的好例子。对于每个curl请求，scrapyd都会生成一个进程，希望为每个url请求生成一个进程不是一个好主意