django服务器运行时,每n分钟在后台运行一次spider
我有一个django项目。在这个项目中,有一些爬虫从一些网站抓取数据并将其存储在数据库中。使用django,将显示这些已爬网的数据 这是项目的结构:django服务器运行时,每n分钟在后台运行一次spider,django,scrapy,Django,Scrapy,我有一个django项目。在这个项目中,有一些爬虫从一些网站抓取数据并将其存储在数据库中。使用django,将显示这些已爬网的数据 这是项目的结构: -prj db.sqlite3 manage.py -prj __init__.py settings.py urls.py wsgi.py -prj_app __init__.py prj_spider.py admin.py
-prj
db.sqlite3
manage.py
-prj
__init__.py
settings.py
urls.py
wsgi.py
-prj_app
__init__.py
prj_spider.py
admin.py
apps.py
models.py
runner.py
urls.py
views.py
我想在django服务器运行时,每5分钟在后台运行一次所有spider。在views.py
中,我导入了runner.py
,在runner.py
中,所有爬行器都开始爬行
views.py:
from . import runner
runner.py:
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from .prj_spider import PrjSpider
from background_task import background
@background()
def run_spider(spider):
def f(q):
try:
configure_logging()
runner = CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
for spider in spiders:
run_spider(DivarSpider, repeat=60)
@background()
def fetch_data():
runner = CrawlerRunner()
runner.crawl(PrjSpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
fetch_data(repeat=60)
运行服务器时,出现以下错误:
TypeError:类型为的对象不可JSON序列化
同样使用这种类型的runner.py
,我得到以下错误:
runner.py:
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from multiprocessing import Process, Queue
from .prj_spider import PrjSpider
from background_task import background
@background()
def run_spider(spider):
def f(q):
try:
configure_logging()
runner = CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
for spider in spiders:
run_spider(DivarSpider, repeat=60)
@background()
def fetch_data():
runner = CrawlerRunner()
runner.crawl(PrjSpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
fetch_data(repeat=60)
错误:
引发错误.ReactorNotRestartable()twisted.internet.error.ReactorNotRestartable
你以前有没有用反应堆一起启动过蜘蛛?我的意思是这个错误是不是突然出现了?因为我试着用反应堆运行所有蜘蛛,但运行不顺利。@MuratDemir是的,我可以在后台一起启动所有蜘蛛一次。。。但我不知道该怎么安排