Python 将spider的输出保存在变量中而不是文件中_Python_Web Scraping_Scrapy Spider

Python 将spider的输出保存在变量中而不是文件中

python web-scraping

Python 将spider的输出保存在变量中而不是文件中,python,web-scraping,scrapy-spider,Python,Web Scraping,Scrapy Spider,我正在寻找一种将spider输出保存在python变量中的方法，而不是将其保存在json文件中并在程序中读回 import scrapy from scrapy.crawler import CrawlerProcess class TestSpider(scrapy.Spider): name = 'test' start_urls = ['https://www.wikipedia.org'] def parse(self, response):

我正在寻找一种将spider输出保存在python变量中的方法，而不是将其保存在

json

文件中并在程序中读回

import scrapy
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.wikipedia.org']

    def parse(self, response):
        yield {
                'text' : response.css(".jsl10n.localized-slogan::text").extract_first()
             }

if __name__ == "__main__":
    os.remove('result.json')
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'FEED_FORMAT': 'json',
        'FEED_URI': 'result.json'
    })

    process.crawl(TestSpider)
    process.start()

我希望避免下面的步骤，直接读取值，而不是先将其保存在磁盘上

with io.open('result.json', encoding='utf-8') as json_data:
        d = json.load(json_data)
        text = d[0]['text']

我最终使用

global

变量来存储输出，这解决了我的问题

import scrapy
from scrapy.crawler import CrawlerProcess

outputResponse = {}

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.wikipedia.org']

    def parse(self, response):
        global outputResponse
        outputResponse['text'] = response.css(".jsl10n.localized-slogan::text").extract_first()

if __name__ == "__main__":
    os.remove('result.json')
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    })

    process.crawl(TestSpider)
    process.start()

您还可以将对象传递到spider并对其进行更改，如下所示：

类TestSpider（scrapy.Spider）：名称='test' 起始URL=['https://www.wikipedia.org'] def解析（自我，响应）： self.outputResponse['text']=response.css（“.jsl10n.localized标语：：text”）.extract_first（）如果名称=“\uuuuu main\uuuuuuuu”： remove（'result.json'） outputResponse={} 进程=爬网进程({ “用户代理”：“Mozilla/4.0（兼容；MSIE 7.0；Windows NT 5.1）”， }) 爬网（TestSpider，outputResponse=outputResponse） process.start（）这是可行的，因为传递给spider构造函数的每个命名参数都作为属性分配给一个实例，这就是为什么可以在

parse

方法中使用

self.outputResponse

并访问外部对象的原因。

我认为这很有帮助。