Python Scrapy-创建嵌套JSON对象

Python Scrapy-创建嵌套JSON对象,python,arrays,json,nested,scrapy,Python,Arrays,Json,Nested,Scrapy,我正在学习如何使用Scrapy,同时刷新我在学校学到的Python?/编码知识 目前,我正在处理列表,但正在处理JSON输出文件 我目前的代码是: # -*- coding: utf-8 -*- import scrapy from top250imdb.items import Top250ImdbItem class ActorsSpider(scrapy.Spider): name = "actors" allowed_domains = ["imdb.com"]

我正在学习如何使用Scrapy,同时刷新我在学校学到的Python?/编码知识

目前,我正在处理列表,但正在处理JSON输出文件

我目前的代码是:

 # -*- coding: utf-8 -*-
import scrapy

from top250imdb.items import Top250ImdbItem


class ActorsSpider(scrapy.Spider):
    name = "actors"
    allowed_domains = ["imdb.com"]
    start_urls = ['http://www.imdb.com/chart/top']

    # Parsing each movie and preparing the url for the actors list
    def parse(self, response):
        for film in response.css('.titleColumn'):
            url = film.css('a::attr(href)').extract_first()
            actors_url = 'http://imdb.com' + url[:17] + 'fullcredits?ref_=tt_cl_sm#cast'
            yield scrapy.Request(actors_url, self.parse_actor)

    # Finding all actors and storing them on item
    # Refer to items.py
    def parse_actor(self, response):
        final_list = []
        item = Top250ImdbItem()
        item['poster'] = response.css('#main img::attr(src)').extract_first()
        item['title'] = response.css('h3[itemprop~=name] a::text').extract()
        item['photo'] = response.css('#fullcredits_content .loadlate::attr(loadlate)').extract()
        item['actors'] = response.css('td[itemprop~=actor] span::text').extract()

        final_list.append(item)

        updated_list = []

        for item in final_list:
            for i in range(len(item['title'])):
                sub_item = {}
                sub_item['movie'] = {}
                sub_item['movie']['poster'] = [item['poster']]
                sub_item['movie']['title'] = [item['title'][i]]
                sub_item['movie']['photo'] = [item['photo']]
                sub_item['movie']['actors'] = [item['actors']]
                updated_list.append(sub_item)
            return updated_list
我的输出文件提供了以下JSON组合:

[
  {
    "movie": {
      "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
      "title": ["The Shawshank Redemption"], 
      "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
      "actors": [["Tim Robbins","Morgan Freeman",...]]}
    },{
    "movie": {
      "poster": ["https://images-na.ssl-images-amazon.com/poster..."], 
      "title": ["The Godfather"], 
      "photo": [["https://images-na.ssl-images-amazon.com/photo..."]], 
      "actors": [["Alexandre Rodrigues", "Leandro Firmino", "Phellipe Haagensen",...]]}
  }
]
但我希望实现这一目标:

{
  "movies": [{
    "poster": "https://images-na.ssl-images-amazon.com/poster...",
    "title": "The Shawshank Redemption",
    "actors": [
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Tim Robbins"},
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Morgan Freeman"},...
    ]
  },{
    "poster": "https://images-na.ssl-images-amazon.com/poster...",
    "title": "The Godfather",
    "actors": [
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Marlon Brando"},
      {"photo": "https://images-na.ssl-images-amazon.com/photo...",
      "name": "Al Pacino"},...
    ]
  }]
}
在my items.py文件中,我有以下内容:

import scrapy


class Top250ImdbItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # Items from actors.py
    poster = scrapy.Field()
    title = scrapy.Field()
    photo = scrapy.Field()
    actors = scrapy.Field()
    movie = scrapy.Field()
    pass
我意识到以下几点:

  • 我的结果不是按顺序出来的,网页列表上的第一部电影总是我输出文件上的第一部电影,但其余的不是。我还在努力

  • 我可以做同样的事情,但是使用Top250ImdbItem(),仍然可以以更详细的方式浏览如何做到这一点

  • 这可能不是我的JSON的完美布局,欢迎您提出建议,如果是,请让我知道,即使我知道没有完美的方式或“唯一的方式”

  • 有些演员没有照片,它实际上加载了不同的CSS选择器。现在,我希望避免使用“无图片缩略图”,因此可以将这些项目留空

  • 例如:

    {"photo": "", "name": "Al Pacino"}
    
    问题:。。。正在处理JSON输出文件


    注意:无法使用您的
    ActorsSpider
    ,获取错误:不支持伪元素

    问题:。。。正在处理JSON输出文件


    注意:无法使用您的
    ActorsSpider
    ,获取错误:不支持伪元素


    不要使用
    (scrapy.Item)
    使用
    dict
    并从
    电影开始:[]
    。嘿,@stovfl你能再详细一点吗?不要使用
    (scrapy.Item)
    使用
    dict
    并从
    电影开始吗:[]
    。嘿,@stovfl你能再详细一点吗?好的,我会检查一下,有点奇怪,你不能运行它,我实际上仍然在使用完全相同的代码,我也会检查这个问题并更新,看看你是否可以运行它,我会尝试那些建议,不,实际上照片和演员还没有同步,仍然在想如何做,但是你的帮助真的很棒。我应该把我修改过的工作代码作为注释发布在这里,编辑当前代码还是保持原样?你的问题,只添加更改的部分,好的,我会检查,你不能运行它,我实际上仍然在使用完全相同的代码,我会检查这个问题,并更新,看看你是否可以运行它,我会尝试那些建议,不,实际上照片和演员还没有同步,仍然在想怎么做,但是你的帮助真的很棒。我应该把我修改过的工作代码作为注释发布在这里,编辑当前代码还是保持原样?您的问题,只添加更改部分的
    # Define a `dict` **once**
    top250ImdbItem = {'movies': []}
    
    def parse_actor(self, response):
        poster = response.css(...
        title = response.css(...
        photos = response.css(...
        actors = response.css(...
    
        # Assuming List of Actors are in sync with List of Photos
        actors_list = []
        for i, actor in enumerate(actors):
            actors_list.append({"name": actor, "photo": photos[i]})
    
        one_movie = {"poster": poster,
                     "title": title,
                     "actors": actors_list
                    }
    
        # Append One Movie to Top250 'movies' List
        top250ImdbItem['movies'].append(one_movie)