Python 在scrapy in循环中从脚本运行多个spider_Python_Web Scraping_Scrapy

Python 在scrapy in循环中从脚本运行多个spider

python web-scraping scrapy

Python 在scrapy in循环中从脚本运行多个spider,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我有100多个spider，我想使用脚本一次运行5个spider。为此，我在数据库中创建了一个表，以了解爬行器的状态，即它是否已完成运行、正在运行或等待运行。我知道如何在一个脚本中运行多个spider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings process = CrawlerProcess(get_project_settings()) fo

我有100多个spider，我想使用脚本一次运行5个spider。为此，我在数据库中创建了一个表，以了解爬行器的状态，即它是否已完成运行、正在运行或等待运行。
我知道如何在一个脚本中运行多个spider

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())
for i in range(10):  #this range is just for demo instead of this i 
                    #find the spiders that are waiting to run from database
    process.crawl(spider1)  #spider name changes based on spider to run
    process.crawl(spider2)
    print('-------------this is the-----{}--iteration'.format(i))
    process.start()

但这是不允许的，因为发生以下错误：

Traceback (most recent call last):
File "test.py", line 24, in <module>
  process.start()
File "/home/g/projects/venv/lib/python3.4/site-packages/scrapy/crawler.py", line 285, in start
  reactor.run(installSignalHandlers=False)  # blocking call
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1242, in run
  self.startRunning(installSignalHandlers=installSignalHandlers)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1222, in startRunning
  ReactorBase.startRunning(self)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 730, in startRunning
  raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

回溯（最近一次呼叫最后一次）：
文件“test.py”，第24行，在
process.start（）
文件“/home/g/projects/venv/lib/python3.4/site packages/scrapy/crawler.py”，第285行，开始
reactor.run（installSignalHandlers=False）#阻止调用
文件“/home/g/projects/venv/lib/python3.4/site packages/twisted/internet/base.py”，第1242行，运行中
self.startRunning（installSignalHandlers=installSignalHandlers）
文件“/home/g/projects/venv/lib/python3.4/site packages/twisted/internet/base.py”，第1222行，在startRunning中
反应器基础启动耳轴（自）
文件“/home/g/projects/venv/lib/python3.4/site packages/twisted/internet/base.py”，第730行，在startRunning中
引发错误。ReactorNotRestartable（）
twisted.internet.error.ReactorNotRestartable

我已搜索上述错误，但无法解决它。可以通过

ScrapyD

管理spider，但我们不想使用

ScrapyD

，因为许多spider仍处于开发阶段

欢迎为上述场景提供任何解决方案

谢谢

您需要用于此目的

您可以同时运行任意数量的spider，您可以不断检查spider是否正在运行或未使用

您可以在中设置

max\u proc=5

，以便一次最多运行5个spider

不管怎样，谈到你的代码，如果你这样做，你的代码应该是有效的

process = CrawlerProcess(get_project_settings())
for i in range(10):  #this range is just for demo instead of this i 
                    #find the spiders that are waiting to run from database
    process.crawl(spider1)  #spider name changes based on spider to run
    process.crawl(spider2)
    print('-------------this is the-----{}--iteration'.format(i))
process.start()

您需要将

process.start（）

放在循环外部。

为此需要

您可以同时运行任意数量的spider，您可以不断检查spider是否正在运行或未使用

您可以在中设置

max\u proc=5

，以便一次最多运行5个spider

不管怎样，谈到你的代码，如果你这样做，你的代码应该是有效的

process = CrawlerProcess(get_project_settings())
for i in range(10):  #this range is just for demo instead of this i 
                    #find the spiders that are waiting to run from database
    process.crawl(spider1)  #spider name changes based on spider to run
    process.crawl(spider2)
    print('-------------this is the-----{}--iteration'.format(i))
process.start()

您需要将

process.start（）

放在循环外部。

要同时运行多个spider，您可以使用此选项

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

的答案也能帮助你

有关更多信息：

要同时运行多个爬行器，可以使用此选项

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished

的答案也能帮助你

有关更多信息：

我可以通过从脚本中删除循环并每3分钟设置一个调度程序来实现类似的功能

循环功能是通过维护当前正在运行的爬行器数量的记录并检查是否需要运行更多爬行器来实现的。因此，最后，只有5个爬行器（可以更改）spider可以同时运行。

我可以通过从脚本中删除循环并每3分钟设置一个调度程序来实现类似的功能

我不想像这里所说的那样使用scrapyD，因为大多数spider仍处于开发阶段。通过执行

process.start（）

outside for loop，它将同时启动20个spidertime@Gaur93这并不重要，至少在使用scrapyD时，您可以清楚地访问spider的日志和项目，您可以在localhost中安装scrapyD。我不想像这里所说的那样使用scrapyD，因为大多数spider仍处于开发阶段。通过执行

process.start（）

outside for loop，它将同时启动20个spidertime@Gaur93这并不重要，至少在使用scrapyD时，您可以清楚地访问spider的日志和项目，您可以在localhost中安装scrapyD。我已经尝试了所有这些，但这对我不起作用，因为我希望它们一次在循环5中，而不是同时在循环5中，您所说的“不起作用，因为我希望它们一次在循环5中，而不是同时在循环5中”是什么意思？如果您在process.crawl中添加如上所述的爬行器，则它们将同时运行。我已经尝试了所有这些，但这对我不起作用，因为我希望它们一次在循环5中，而不是同时在循环5中，您所说的“不起作用，因为我希望它们一次在循环5中，而不是同时在循环5中”是什么意思？如果您在process.crawl中添加如上所述的爬行器，那么它们将同时运行。我也在寻找一个完全类似的解决方案。您使用了什么库进行调度，添加如何查询当前有多少spider在运行？@NFB我没有使用任何库进行调度。我自己写了一个调度程序。为了处理当前运行的爬行器数量，我将单个爬行器的状态存储在数据库中。当一个spider启动时，首先调用start\u requests方法（若您有），所以将该spider的状态更改为running。当爬行器关闭/完成时，将调用closed方法，以便您可以将状态更改为finished或not running。在运行爬行器之前，您可以检查当前正在运行的爬行器数量以及您希望运行的爬行器数量。感谢您提供的信息-您是如何防止代码在爬行过程中被阻塞的？@NFB我将并发请求限制为2，并增加了下载延迟。我也在寻找类似的解决方案。您使用了什么库进行调度，添加如何查询当前有多少spider在运行？@NFB我没有使用任何库进行调度。我自己写了一个调度程序。为了处理当前运行的爬行器数量，我将单个爬行器的状态存储在数据库中。当一个spider启动时，首先调用start\u requests方法（若您有），所以将该spider的状态更改为running。当爬行器关闭/完成时，将调用closed方法，以便您可以将状态更改为finished或not r