Python 2.7 从脚本中删除。不会导出数据

Python 2.7 从脚本中删除。不会导出数据,python-2.7,web,web-scraping,scrapy,twisted.internet,Python 2.7,Web,Web Scraping,Scrapy,Twisted.internet,我试图从脚本运行scrapy,但无法让程序创建导出文件 我尝试以两种不同的方式导出文件: 用管道 与饲料出口 当我从命令行运行scrapy时,这两种方法都有效,但当我从脚本运行scrapy时,这两种方法都无效 我不是唯一一个有这个问题的人。这里还有两个类似的未回答的问题。直到我发布了这个问题,我才注意到这些 下面是我从脚本中运行scrapy的代码。它包括使用管道和提要导出器打印输出文件的设置 from twisted.internet import reactor from scrapy

我试图从脚本运行scrapy,但无法让程序创建导出文件

我尝试以两种不同的方式导出文件:

  • 用管道
  • 与饲料出口
  • 当我从命令行运行scrapy时,这两种方法都有效,但当我从脚本运行scrapy时,这两种方法都无效

    我不是唯一一个有这个问题的人。这里还有两个类似的未回答的问题。直到我发布了这个问题,我才注意到这些

  • 下面是我从脚本中运行scrapy的代码。它包括使用管道和提要导出器打印输出文件的设置

    from twisted.internet import reactor
    
    from scrapy import log, signals
    from scrapy.crawler import Crawler
    from scrapy.xlib.pydispatch import dispatcher
    import logging
    
    from external_links.spiders.test import MySpider
    from scrapy.utils.project import get_project_settings
    settings = get_project_settings()
    
    #manually set settings here
    settings.set('ITEM_PIPELINES',{'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline':200},priority='cmdline')
    settings.set('DEPTH_LIMIT',1,priority='cmdline')
    settings.set('LOG_FILE','Log.log',priority='cmdline')
    settings.set('FEED_URI','output.csv',priority='cmdline')
    settings.set('FEED_FORMAT', 'csv',priority='cmdline')
    settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
    settings.set('FEED_STORE_EMPTY',True,priority='cmdline')
    
    def stop_reactor():
        reactor.stop()
    
    dispatcher.connect(stop_reactor, signal=signals.spider_closed)
    spider = MySpider()
    crawler = Crawler(settings)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start(loglevel=logging.DEBUG)
    log.msg('reactor running...')
    reactor.run()
    log.msg('Reactor stopped...')
    
    在我运行此代码后,日志显示:“在:output.csv中存储了csv提要(341项)”,但找不到output.csv。

    这是我的提要导出器代码:

    settings = get_project_settings()
    
    #manually set settings here
    settings.set('ITEM_PIPELINES',   {'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline': 200},priority='cmdline')
    settings.set('DEPTH_LIMIT',1,priority='cmdline')
    settings.set('LOG_FILE','Log.log',priority='cmdline')
    settings.set('FEED_URI','output.csv',priority='cmdline')
    settings.set('FEED_FORMAT', 'csv',priority='cmdline')
    settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
    settings.set('FEED_STORE_EMPTY',True,priority='cmdline')
    
    
    from scrapy.contrib.exporter import CsvItemExporter
    
    
    class CsvOptionRespectingItemExporter(CsvItemExporter):
    
        def __init__(self, *args, **kwargs):
            delimiter = settings.get('CSV_DELIMITER', ',')
            kwargs['delimiter'] = delimiter
            super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)
    
    以下是我的管道代码:

    import csv
    
    class CsvWriterPipeline(object):
    
    def __init__(self):
        self.csvwriter = csv.writer(open('items2.csv', 'wb'))
    
    def process_item(self, item, spider): #item needs to be second in this list otherwise get spider object
        self.csvwriter.writerow([item['all_links'], item['current_url'], item['start_url']])
    
        return item
    

    我也有同样的问题

    以下是对我有用的东西:

  • 将导出uri放入
    settings.py

    FEED\u-URI=file:///tmp/feeds/filename.jsonlines“

  • 使用以下内容在
    scrapy.cfg
    旁边创建一个
    scrape.py
    脚本

     
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    
    process = CrawlerProcess(get_project_settings())
    
    process.crawl('yourspidername') #'yourspidername' is the name of one of the spiders of the project.
    process.start() # the script will block here until the crawling is finished
    
    
  • 运行:
    python scrap.py

  • 结果:文件被创建

    注意:我的项目中没有管道。因此,不确定管道是否会过滤您的结果


    还有:以下是帮助我解决问题的常见陷阱部分

    你曾经解决过这个问题吗?