Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/313.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何让这个爬行器为每个项目列表导出JSON文件?_Python_Json_Python 3.x_Scrapy_Scrapy Spider - Fatal编程技术网

Python 如何让这个爬行器为每个项目列表导出JSON文件?

Python 如何让这个爬行器为每个项目列表导出JSON文件?,python,json,python-3.x,scrapy,scrapy-spider,Python,Json,Python 3.x,Scrapy,Scrapy Spider,在我下面的文件Reddit.py中,它有一个Spider: import scrapy class RedditSpider(scrapy.Spider): name = 'Reddit' allowed_domains = ['reddit.com'] start_urls = ['https://old.reddit.com'] def parse(self, response): for link in response.css('li

在我下面的文件
Reddit.py
中,它有一个Spider:

import scrapy

class RedditSpider(scrapy.Spider):
    name = 'Reddit'
    allowed_domains = ['reddit.com']
    start_urls = ['https://old.reddit.com']

    def parse(self, response):

        for link in response.css('li.first a.comments::attr(href)').extract():
            yield scrapy.Request(url=response.urljoin(link), callback=self.parse_topics)



    def parse_topics(self, response):
        topics = {}
        topics["title"] = response.css('a.title::text').extract_first()
        topics["author"] = response.css('p.tagline a.author::text').extract_first()

        if response.css('div.score.likes::attr(title)').extract_first() is not None:
            topics["score"] = response.css('div.score.likes::attr(title)').extract_first()
        else:
            topics["score"] = "0"

        if int(topics["score"]) > 10000:
            author_url = response.css('p.tagline a.author::attr(href)').extract_first()
            yield scrapy.Request(url=response.urljoin(author_url), callback=self.parse_user, meta={'topics': topics})
        else:
            yield topics

    def parse_user(self, response):
        topics = response.meta.get('topics')

        users = {}
        users["name"] = topics["author"]
        users["karma"] = response.css('span.karma::text').extract_first()

        yield users
        yield topics
它从
old.reddit
主页获取所有URL,然后刮取每个URL的标题作者分数

我添加的是第二部分,它检查分数是否高于10000,如果高于10000,则蜘蛛会转到用户的页面,从中刮去他的业力

我知道我可以从主题的页面中刮取业力,但我想这样做,因为我刮取的用户页面的其他部分在主题的页面中不存在

我想做的是将包含
标题、作者、分数的
主题
列表导出到名为
主题.JSON
文件中,然后如果主题的分数高于10000,则导出包含
名称的
用户
列表,karma
转换成名为
users.JSON
JSON文件

我只知道如何使用的
命令行

scrapy runspider Reddit.py -o Reddit.json
它将所有列表导出到一个名为
Reddit
JSON
文件中,但其结构如下

[
  {"name": "Username", "karma": "00000"},
  {"title": "ExampleTitle1", "author": "Username", "score": "11000"},
  {"name": "Username2", "karma": "00000"},
  {"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
  {"name": "Username3", "karma": "00000"},
  {"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
  {"title": "ExampleTitle4", "author": "Username4", "score": "9000"},
  ....
]

我对Scrapy的<代码>项目管道<代码>和<代码>项目导出器<代码>&<代码>提要导出器<代码>完全不了解如何在我的Spider上实现它们,或者如何整体使用它们,我试图从文档中理解它,但我似乎不知道如何在我的Spider中使用它


我想要的最终结果是两个文件:

topics.json

[
{"type": "unknown item"},
{"someothertype": "unknown item"}
]
[
{"title": "ExampleTitle1", "author": "Username", "score": "11000"},
{"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
{"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
{"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
]
[
{"name": "Username", "karma": "00000"},
{"name": "Username2", "karma": "00000"},
{"name": "Username3", "karma": "00000"}
]
users.json

[
{"type": "unknown item"},
{"someothertype": "unknown item"}
]
[
{"title": "ExampleTitle1", "author": "Username", "score": "11000"},
{"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
{"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
{"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
]
[
{"name": "Username", "karma": "00000"},
{"name": "Username2", "karma": "00000"},
{"name": "Username3", "karma": "00000"}
]

清除列表中的重复项。

爬行器在爬网用户页面时产生两项。如果:

def parse_user(self, response):
    topics = response.meta.get('topics')

    users = {}
    users["name"] = topics["author"]
    users["karma"] = response.css('span.karma::text').extract_first()
    topics["users"] = users

    yield topics
您可以根据需要对JSON进行后期处理


顺便说一句,我不明白你为什么在处理单个元素(单个“主题”)时使用复数(“主题”)。

从下面的线程应用方法

我创建了一个样本刮刀

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        yield {"type": "unknown item"}
        yield {"title": "ExampleTitle1", "author": "Username", "score": "11000"}
        yield {"name": "Username", "karma": "00000"}
        yield {"name": "Username2", "karma": "00000"}
        yield {"someothertype": "unknown item"}

        yield {"title": "ExampleTitle2", "author": "Username2", "score": "12000"}
        yield {"title": "ExampleTitle3", "author": "Username3", "score": "13000"}
        yield {"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
        yield {"name": "Username3", "karma": "00000"}
然后在
exporters.py中

from scrapy.exporters import JsonItemExporter
from scrapy.extensions.feedexport import FileFeedStorage


class JsonMultiFileItemExporter(JsonItemExporter):
    types = ["topics", "users"]

    def __init__(self, file, **kwargs):
        super().__init__(file, **kwargs)
        self.files = {}
        self.kwargs = kwargs

        for itemtype in self.types:
            storage = FileFeedStorage(itemtype + ".json")
            file = storage.open(None)
            self.files[itemtype] = JsonItemExporter(file, **self.kwargs)

    def start_exporting(self):
        super().start_exporting()
        for exporters in self.files.values():
            exporters.start_exporting()

    def finish_exporting(self):
        super().finish_exporting()
        for exporters in self.files.values():
            exporters.finish_exporting()
            exporters.file.close()

    def export_item(self, item):
        if "title" in item:
            itemtype = "topics"
        elif "karma" in item:
            itemtype = "users"
        else:
            itemtype = "self"

        if itemtype == "self" or itemtype not in self.files:
            super().export_item(item)
        else:
            self.files[itemtype].export_item(item)
FEED_EXPORTERS = {
    'json': 'testing.exporters.JsonMultiFileItemExporter',
}
将以下内容添加到
设置.py

from scrapy.exporters import JsonItemExporter
from scrapy.extensions.feedexport import FileFeedStorage


class JsonMultiFileItemExporter(JsonItemExporter):
    types = ["topics", "users"]

    def __init__(self, file, **kwargs):
        super().__init__(file, **kwargs)
        self.files = {}
        self.kwargs = kwargs

        for itemtype in self.types:
            storage = FileFeedStorage(itemtype + ".json")
            file = storage.open(None)
            self.files[itemtype] = JsonItemExporter(file, **self.kwargs)

    def start_exporting(self):
        super().start_exporting()
        for exporters in self.files.values():
            exporters.start_exporting()

    def finish_exporting(self):
        super().finish_exporting()
        for exporters in self.files.values():
            exporters.finish_exporting()
            exporters.file.close()

    def export_item(self, item):
        if "title" in item:
            itemtype = "topics"
        elif "karma" in item:
            itemtype = "users"
        else:
            itemtype = "self"

        if itemtype == "self" or itemtype not in self.files:
            super().export_item(item)
        else:
            self.files[itemtype].export_item(item)
FEED_EXPORTERS = {
    'json': 'testing.exporters.JsonMultiFileItemExporter',
}
运行刮板我得到3个文件生成

example.json

[
{"type": "unknown item"},
{"someothertype": "unknown item"}
]
[
{"title": "ExampleTitle1", "author": "Username", "score": "11000"},
{"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
{"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
{"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
]
[
{"name": "Username", "karma": "00000"},
{"name": "Username2", "karma": "00000"},
{"name": "Username3", "karma": "00000"}
]
topics.json

[
{"type": "unknown item"},
{"someothertype": "unknown item"}
]
[
{"title": "ExampleTitle1", "author": "Username", "score": "11000"},
{"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
{"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
{"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
]
[
{"name": "Username", "karma": "00000"},
{"name": "Username2", "karma": "00000"},
{"name": "Username3", "karma": "00000"}
]
users.json

[
{"type": "unknown item"},
{"someothertype": "unknown item"}
]
[
{"title": "ExampleTitle1", "author": "Username", "score": "11000"},
{"title": "ExampleTitle2", "author": "Username2", "score": "12000"},
{"title": "ExampleTitle3", "author": "Username3", "score": "13000"},
{"title": "ExampleTitle4", "author": "Username4", "score": "9000"}
]
[
{"name": "Username", "karma": "00000"},
{"name": "Username2", "karma": "00000"},
{"name": "Username3", "karma": "00000"}
]

所需的输出格式是什么?另外,我知道您希望每个找到的主题最多输出一个项目。@Apalala我实际上想让每个
yield
输出都有自己的
JSON
文件,而不是全部放在一个文件中。我的目标是通过复数名
主题
用户
,将它们都放在不同的文件中,这就是为什么我使用复数,只是为了告诉哪个文件的listScrapy将生成一个JSON记录流。使用
jq
等工具可以轻松地对该流进行后期处理。如果您希望流中有两种类型的记录,请为每种类型添加一个
type
字段以便于区分。因此,我在文件中做了以下更改:exporters.py=>-settings:py=>-Reddit.py=>|然后编写了
scrapy runspider Reddit.py
,但没有发生任何事情,我错过了什么,或者我应该将这些文件移动到一个文件夹中吗?您需要使用
-o示例。json
出现此错误
ModuleNotFoundError:没有名为“testing”的模块
,这是因为testing是我创建的不完整的项目名称。您需要将其更新为您的项目名称