Python 如何将粗略的JSON输出转换为数据帧

Python 如何将粗略的JSON输出转换为数据帧,python,json,pandas,dataframe,scrapy,Python,Json,Pandas,Dataframe,Scrapy,我已经编写了一个创建JSON文件的Scrapy脚本,但我正在努力将其放入数据帧中。我在考虑两种方法: a)以某种方式处理我的输出,并尝试将其转换为df结构。我不知道如何将其转换为具有b)中所述结构的df 此时,我的JSON输出如下所示,字母表示参与者,数字表示角色: {"title": {"episode1": {"actor": {"x": "1", "y": "2

我已经编写了一个创建JSON文件的Scrapy脚本,但我正在努力将其放入数据帧中。我在考虑两种方法:

a)以某种方式处理我的输出,并尝试将其转换为df结构。我不知道如何将其转换为具有b)中所述结构的df

此时,我的JSON输出如下所示,字母表示参与者,数字表示角色:

{"title": {"episode1": {"actor": {"x": "1", "y": "2", "z": "3"}}}},
{"title": {"episode2": {"actor": {"x": "1", "y": "2", "z": "3"}}}}]
b)按照以下方式构造JSON文件,这样可以非常容易地将其转换为df,并使用
JSON\u normalize

dictionary = [{'title':'episode1', 'actors': {'actor': 'x', 'char': 1}},
             {'title':'episode1', 'actors': {'actor': 'y', 'char': 2}},
             {'title':'episode1', 'actors': {'actor': 'z', 'char': 3}},
             {'title':'episode2', 'actors': {'actor': 'x', 'char': 1}}]
可以转换为这种格式的df:

df = json_normalize(dictionary)
df.columns = df.columns.map(lambda x: x.split(".")[-1])
任何关于如何创建更有用的JSON格式的指针都将不胜感激。这是我的代码:

import unidecode
import scrapy
import re

class ActorsSpider(scrapy.Spider):

    name = 'actors'
    start_urls = ['https://ozark-netflix.fandom.com/wiki/Category:Episodes']

    def parse(self, response):
        episode_urls = response.xpath('//a[contains(@class, "category-page__member-link") and not (contains(@title, "Season"))]/@href')
        yield from response.follow_all(episode_urls, self.parse_episode)

    def parse_episode(self, response):
        episode_title = response.xpath('//h1[@id="firstHeading"]/text()').get()
        episode_title = unidecode.unidecode(episode_title)
        
        # Only download list items from cast table. This includes html tags
        if episode_title != 'Sugarwood' and episode_title != 'BFF': # Sugarwood and BFF have a different structure
            chars_and_actors2 = response.xpath('//div[@class="mw-parser-output"]//ul').getall()[1]
        elif episode_title == 'Sugarwood':
            chars_and_actors2 = response.xpath('//div[@class="mw-parser-output"]//ul').getall()[3]
        elif episode_title == 'BFF':
            chars_and_actors2 = response.xpath('//div[@class="mw-parser-output"]//ul').getall()[0]
        
        # Remove html tags
        chars_and_actors2_regex = ["".join(x) for x in re.findall('href="/wiki/(.*?)"|title="(.*?)"|<li>(.*?)<a|Harner</a>(.*?)</li>|</a>(.*?)<small>', chars_and_actors2)]
        
        # Remove empty strings
        chars_and_actors2_clean = list(filter(None, chars_and_actors2_regex))
        
        # Remove duplicate elements and create [actor, character] list
        even_cleaner = []
        for i in chars_and_actors2_clean:
            i = i.replace('_', ' ').replace(' as ', '').replace('</a>', '').replace('<a href=\"/wiki/Roy Petty\" title=\"Roy Petty\">', '').rstrip()
            if i not in even_cleaner:
                even_cleaner.append(i)
        
        # Bring actor-character relationship into dictionary format
        zipped = list(zip(even_cleaner[0::2], even_cleaner[1::2]))
        actor_char = dict(zipped)
        
        yield{'title': {episode_title: {'actor': actor_char}}}
导入unidecode
进口羊瘙痒
进口稀土
类ActorsSpider(scrapy.Spider):
名称='actors'
起始URL=['https://ozark-netflix.fandom.com/wiki/Category:Episodes']
def解析(自我,响应):
eposion\u url=response.xpath('//a[contains(@class,“category-page\uu-member-link”)和not(contains(@title,“seasure”)]/@href')
从响应中获得收益。跟随所有(插曲URL、self.parse\u插曲)
def parse_插曲(自我,回应):
eposion_title=response.xpath('//h1[@id=“firstHeading”]/text()).get()
插曲标题=unidecode.unidecode(插曲标题)
#仅从cast表下载列表项。这包括html标记
如果一集的标题!='Sugarwood'和插曲标题!='BFF’:#Sugarwood和BFF的结构不同
chars_and_actors2=response.xpath('//div[@class=“mw parser output”]//ul').getall()[1]
elif插曲标题=‘Sugarwood’:
chars_and_actors2=response.xpath('//div[@class=“mw parser output”]//ul').getall()[3]
elif插曲标题=‘BFF’:
chars_and_actors2=response.xpath('//div[@class=“mw parser output”]//ul').getall()[0]
#删除html标记
chars_and_actors2_regex=[“”。在re.findall('href=“/wiki/(*)?”| title=“(*)”|
  • (*)中为x加入(x)