Python 使用Pandas读取子级别JSON数据
我在使用Pandas读取子级别数据时卡住了 背景: 我使用NYT Archive API下载了一系列数据,并将其存储在一个JSON文件中,该文件实际上包含JSON对象列表 程序: 我使用read_JSON方法读取JSON文件Python 使用Pandas读取子级别JSON数据,python,json,api,pandas,Python,Json,Api,Pandas,我在使用Pandas读取子级别数据时卡住了 背景: 我使用NYT Archive API下载了一系列数据,并将其存储在一个JSON文件中,该文件实际上包含JSON对象列表 程序: 我使用read_JSON方法读取JSON文件 pandas\u df=pd.read\u json(“data.json”) 当我看到使用head的示例结果时,它如下所示: pandas_df.head() copyright \ 0 Copyright (c) 2013 The New York Tim
pandas\u df=pd.read\u json(“data.json”)
当我看到使用head的示例结果时,它如下所示:
pandas_df.head()
copyright \
0 Copyright (c) 2013 The New York Times Company....
1 Copyright (c) 2013 The New York Times Company....
2 Copyright (c) 2013 The New York Times Company....
3 Copyright (c) 2013 The New York Times Company....
4 Copyright (c) 2013 The New York Times Company....
response
0 {'docs': [{'subsection_name': None, 'slideshow...
1 {'docs': [{'subsection_name': None, 'slideshow...
2 {'docs': [{'subsection_name': None, 'slideshow...
3 {'docs': [{'subsection_name': None, 'slideshow...
4 {'docs': [{'subsection_name': None, 'slideshow...
print(pandas_df["response"].head())
0 {'docs': [{'subsection_name': None, 'slideshow...
1 {'docs': [{'subsection_name': None, 'slideshow...
2 {'docs': [{'subsection_name': None, 'slideshow...
3 {'docs': [{'subsection_name': None, 'slideshow...
4 {'docs': [{'subsection_name': None, 'slideshow...
Name: response, dtype: object
我只需要回复中的信息。因此,当我更改如下代码时:
pandas_df.head()
copyright \
0 Copyright (c) 2013 The New York Times Company....
1 Copyright (c) 2013 The New York Times Company....
2 Copyright (c) 2013 The New York Times Company....
3 Copyright (c) 2013 The New York Times Company....
4 Copyright (c) 2013 The New York Times Company....
response
0 {'docs': [{'subsection_name': None, 'slideshow...
1 {'docs': [{'subsection_name': None, 'slideshow...
2 {'docs': [{'subsection_name': None, 'slideshow...
3 {'docs': [{'subsection_name': None, 'slideshow...
4 {'docs': [{'subsection_name': None, 'slideshow...
print(pandas_df["response"].head())
0 {'docs': [{'subsection_name': None, 'slideshow...
1 {'docs': [{'subsection_name': None, 'slideshow...
2 {'docs': [{'subsection_name': None, 'slideshow...
3 {'docs': [{'subsection_name': None, 'slideshow...
4 {'docs': [{'subsection_name': None, 'slideshow...
Name: response, dtype: object
问题:
如何使用文档中的元素获取数据?比如小节,幻灯片等等。我可以用表格格式看它吗,比如数据框
如果需要更多信息,请告诉我
谢谢
编辑1:
从JSON文件添加第一个元素。这个文件太大,大约1GB
{
"copyright": "Copyright (c) 2013 The New York Times Company. All Rights Reserved.",
"response": {
"meta": {
"hits": 7652
},
"docs": [
{
"web_url": "http://www.nytimes.com/interactive/2016/technology/personaltech/cord-cutting-guide.html",
"snippet": "We teamed up with The Wirecutter to come up with cord-cutter bundles for movie buffs, sports addicts, fans of premium TV shows, binge watchers and families with children.",
"lead_paragraph": "We teamed up with The Wirecutter to come up with cord-cutter bundles for movie buffs, sports addicts, fans of premium TV shows, binge watchers and families with children.",
"abstract": null,
"print_page": null,
"blog": [],
"source": "The New York Times",
"multimedia": [
{
"width": 190,
"url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbWide.jpg",
"height": 126,
"subtype": "wide",
"legacy": {
"wide": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbWide.jpg",
"wideheight": "126",
"widewidth": "190"
},
"type": "image"
},
{
"width": 600,
"url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-articleLarge.jpg",
"height": 346,
"subtype": "xlarge",
"legacy": {
"xlargewidth": "600",
"xlarge": "images/2016/10/13/business/13TECHFIX/06TECHFIX-articleLarge.jpg",
"xlargeheight": "346"
},
"type": "image"
},
{
"width": 75,
"url": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbStandard.jpg",
"height": 75,
"subtype": "thumbnail",
"legacy": {
"thumbnailheight": "75",
"thumbnail": "images/2016/10/13/business/13TECHFIX/06TECHFIX-thumbStandard.jpg",
"thumbnailwidth": "75"
},
"type": "image"
}
],
"headline": {
"main": "The Definitive Guide to Cord-Cutting in 2016, Based on Your Habits",
"kicker": "Tech Fix"
},
"keywords": [
{
"rank": "1",
"is_major": "N",
"name": "subject",
"value": "Video Recordings, Downloads and Streaming"
},
{
"rank": "2",
"is_major": "N",
"name": "subject",
"value": "Television Sets and Media Devices"
},
{
"rank": "1",
"is_major": "Y",
"name": "subject",
"value": "Television"
}
],
"pub_date": "2016-01-01T05:00:00Z",
"document_type": "multimedia",
"news_desk": "Technology / Personal Tech",
"section_name": "Technology",
"subsection_name": "Personal Tech",
"byline": {
"person": [
{
"firstname": "Brian",
"middlename": "X.",
"lastname": "CHEN",
"rank": 1,
"role": "reported",
"organization": ""
}
],
"original": "By BRIAN X. CHEN"
},
"type_of_material": "Interactive Feature",
"_id": "57fdfb9895d0e022439c2b57",
"word_count": null,
"slideshow_credits": null
}]}}
您应该能够将嵌套在
响应
字典中的文档
列表下的所有元素提取到数据帧中
import json
with open('data.json') as f:
data = json.load(f)
df = pd.DataFrame(data['response']['docs'])
你能为前几行发布整个原始JSON吗?添加了,请看一看。我想读取“docs”中的大部分值。最后一行给了我一个错误:TypeError:列表索引必须是整数或片,而不是str。你知道为什么会这样吗?是因为我正在读取一个列表中包含多个JSON对象的文件吗?我通过添加一个右括号和两个右大括号对JSON输入进行了一点修改。将确切的json直接复制到一个文件中,然后再次运行我的代码。它应该会起作用。