Python 2.7 从脚本中删除。不会导出数据
我试图从脚本运行scrapy,但无法让程序创建导出文件 我尝试以两种不同的方式导出文件:Python 2.7 从脚本中删除。不会导出数据,python-2.7,web,web-scraping,scrapy,twisted.internet,Python 2.7,Web,Web Scraping,Scrapy,Twisted.internet,我试图从脚本运行scrapy,但无法让程序创建导出文件 我尝试以两种不同的方式导出文件: 用管道 与饲料出口 当我从命令行运行scrapy时,这两种方法都有效,但当我从脚本运行scrapy时,这两种方法都无效 我不是唯一一个有这个问题的人。这里还有两个类似的未回答的问题。直到我发布了这个问题,我才注意到这些 下面是我从脚本中运行scrapy的代码。它包括使用管道和提要导出器打印输出文件的设置 from twisted.internet import reactor from scrapy
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.xlib.pydispatch import dispatcher
import logging
from external_links.spiders.test import MySpider
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
#manually set settings here
settings.set('ITEM_PIPELINES',{'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline':200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=logging.DEBUG)
log.msg('reactor running...')
reactor.run()
log.msg('Reactor stopped...')
在我运行此代码后,日志显示:“在:output.csv中存储了csv提要(341项)”,但找不到output.csv。
这是我的提要导出器代码:
settings = get_project_settings()
#manually set settings here
settings.set('ITEM_PIPELINES', {'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline': 200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')
from scrapy.contrib.exporter import CsvItemExporter
class CsvOptionRespectingItemExporter(CsvItemExporter):
def __init__(self, *args, **kwargs):
delimiter = settings.get('CSV_DELIMITER', ',')
kwargs['delimiter'] = delimiter
super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)
以下是我的管道代码:
import csv
class CsvWriterPipeline(object):
def __init__(self):
self.csvwriter = csv.writer(open('items2.csv', 'wb'))
def process_item(self, item, spider): #item needs to be second in this list otherwise get spider object
self.csvwriter.writerow([item['all_links'], item['current_url'], item['start_url']])
return item
我也有同样的问题 以下是对我有用的东西:
settings.py
FEED\u-URI=file:///tmp/feeds/filename.jsonlines“
scrapy.cfg
旁边创建一个scrape.py
脚本
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('yourspidername') #'yourspidername' is the name of one of the spiders of the project.
process.start() # the script will block here until the crawling is finished
python scrap.py
还有:以下是帮助我解决问题的常见陷阱部分你曾经解决过这个问题吗?