Python 如何将粗略的JSON输出转换为数据帧
我已经编写了一个创建JSON文件的Scrapy脚本,但我正在努力将其放入数据帧中。我在考虑两种方法: a)以某种方式处理我的输出,并尝试将其转换为df结构。我不知道如何将其转换为具有b)中所述结构的dfPython 如何将粗略的JSON输出转换为数据帧,python,json,pandas,dataframe,scrapy,Python,Json,Pandas,Dataframe,Scrapy,我已经编写了一个创建JSON文件的Scrapy脚本,但我正在努力将其放入数据帧中。我在考虑两种方法: a)以某种方式处理我的输出,并尝试将其转换为df结构。我不知道如何将其转换为具有b)中所述结构的df 此时,我的JSON输出如下所示,字母表示参与者,数字表示角色: {"title": {"episode1": {"actor": {"x": "1", "y": "2
此时,我的JSON输出如下所示,字母表示参与者,数字表示角色:
{"title": {"episode1": {"actor": {"x": "1", "y": "2", "z": "3"}}}},
{"title": {"episode2": {"actor": {"x": "1", "y": "2", "z": "3"}}}}]
b)按照以下方式构造JSON文件,这样可以非常容易地将其转换为df,并使用JSON\u normalize
:
dictionary = [{'title':'episode1', 'actors': {'actor': 'x', 'char': 1}},
{'title':'episode1', 'actors': {'actor': 'y', 'char': 2}},
{'title':'episode1', 'actors': {'actor': 'z', 'char': 3}},
{'title':'episode2', 'actors': {'actor': 'x', 'char': 1}}]
可以转换为这种格式的df:
df = json_normalize(dictionary)
df.columns = df.columns.map(lambda x: x.split(".")[-1])
任何关于如何创建更有用的JSON格式的指针都将不胜感激。这是我的代码:
import unidecode
import scrapy
import re
class ActorsSpider(scrapy.Spider):
name = 'actors'
start_urls = ['https://ozark-netflix.fandom.com/wiki/Category:Episodes']
def parse(self, response):
episode_urls = response.xpath('//a[contains(@class, "category-page__member-link") and not (contains(@title, "Season"))]/@href')
yield from response.follow_all(episode_urls, self.parse_episode)
def parse_episode(self, response):
episode_title = response.xpath('//h1[@id="firstHeading"]/text()').get()
episode_title = unidecode.unidecode(episode_title)
# Only download list items from cast table. This includes html tags
if episode_title != 'Sugarwood' and episode_title != 'BFF': # Sugarwood and BFF have a different structure
chars_and_actors2 = response.xpath('//div[@class="mw-parser-output"]//ul').getall()[1]
elif episode_title == 'Sugarwood':
chars_and_actors2 = response.xpath('//div[@class="mw-parser-output"]//ul').getall()[3]
elif episode_title == 'BFF':
chars_and_actors2 = response.xpath('//div[@class="mw-parser-output"]//ul').getall()[0]
# Remove html tags
chars_and_actors2_regex = ["".join(x) for x in re.findall('href="/wiki/(.*?)"|title="(.*?)"|<li>(.*?)<a|Harner</a>(.*?)</li>|</a>(.*?)<small>', chars_and_actors2)]
# Remove empty strings
chars_and_actors2_clean = list(filter(None, chars_and_actors2_regex))
# Remove duplicate elements and create [actor, character] list
even_cleaner = []
for i in chars_and_actors2_clean:
i = i.replace('_', ' ').replace(' as ', '').replace('</a>', '').replace('<a href=\"/wiki/Roy Petty\" title=\"Roy Petty\">', '').rstrip()
if i not in even_cleaner:
even_cleaner.append(i)
# Bring actor-character relationship into dictionary format
zipped = list(zip(even_cleaner[0::2], even_cleaner[1::2]))
actor_char = dict(zipped)
yield{'title': {episode_title: {'actor': actor_char}}}
导入unidecode
进口羊瘙痒
进口稀土
类ActorsSpider(scrapy.Spider):
名称='actors'
起始URL=['https://ozark-netflix.fandom.com/wiki/Category:Episodes']
def解析(自我,响应):
eposion\u url=response.xpath('//a[contains(@class,“category-page\uu-member-link”)和not(contains(@title,“seasure”)]/@href')
从响应中获得收益。跟随所有(插曲URL、self.parse\u插曲)
def parse_插曲(自我,回应):
eposion_title=response.xpath('//h1[@id=“firstHeading”]/text()).get()
插曲标题=unidecode.unidecode(插曲标题)
#仅从cast表下载列表项。这包括html标记
如果一集的标题!='Sugarwood'和插曲标题!='BFF’:#Sugarwood和BFF的结构不同
chars_and_actors2=response.xpath('//div[@class=“mw parser output”]//ul').getall()[1]
elif插曲标题=‘Sugarwood’:
chars_and_actors2=response.xpath('//div[@class=“mw parser output”]//ul').getall()[3]
elif插曲标题=‘BFF’:
chars_and_actors2=response.xpath('//div[@class=“mw parser output”]//ul').getall()[0]
#删除html标记
chars_and_actors2_regex=[“”。在re.findall('href=“/wiki/(*)?”| title=“(*)”|(*)中为x加入(x)