从python脚本运行scrapy
我一直在尝试从python脚本文件运行scrapy,因为我需要获取数据并将其保存到数据库中。但是当我用scrapy命令运行它时从python脚本运行scrapy,python,web-scraping,scrapy,scrapy-spider,Python,Web Scraping,Scrapy,Scrapy Spider,我一直在尝试从python脚本文件运行scrapy,因为我需要获取数据并将其保存到数据库中。但是当我用scrapy命令运行它时 scrapy crawl argos 脚本运行良好 但是当我尝试用脚本运行它时,请遵循以下链接 http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script 我得到这个错误 $ python pricewatch/pricewatch.py update Tracebac
scrapy crawl argos
脚本运行良好
但是当我尝试用脚本运行它时,请遵循以下链接
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
我得到这个错误
$ python pricewatch/pricewatch.py update
Traceback (most recent call last):
File "pricewatch/pricewatch.py", line 39, in <module>
main()
File "pricewatch/pricewatch.py", line 31, in main
update()
File "pricewatch/pricewatch.py", line 24, in update
setup_crawler("argos.co.uk")
File "pricewatch/pricewatch.py", line 13, in setup_crawler
settings = get_project_settings()
File "/Library/Python/2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/project.py", line 58, in get_project_settings
settings_module = import_module(settings_module_path)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
ImportError: No module named settings
我已经修好了
只需将pricewatch.py放在项目的顶级目录中,然后运行它即可解决此问题此答案大量复制自此,我相信它回答了您的问题,并另外提供了一个下降示例
考虑一个具有以下结构的项目
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
基本上,命令
scrapy startproject scraper
在my_项目文件夹中执行,我将run_scraper.py
文件添加到外部scraper文件夹,将main.py
文件添加到我的根文件夹,并将quotes_spider.py
添加到spider文件夹
我的主文件:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
我的run_scraper.py
文件:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spiders = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
另外,请注意,由于路径需要根据根文件夹(my_项目,而不是scraper)确定,因此可能需要查看设置。
就我而言:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
等等。虽然此链接可以回答问题,但最好在此处包含答案的基本部分,并提供链接供参考。如果链接页面发生更改,仅链接的答案可能无效。-(好)我明白了。我已经添加了答案的基本部分。谢谢你的反馈。
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'