Python 从URL加载JSON_Python_Json_Pandas_Google Cloud Platform

Python 从URL加载JSON

python json pandas google-cloud-platform

Python 从URL加载JSON,python,json,pandas,google-cloud-platform,Python,Json,Pandas,Google Cloud Platform,我正在为我的学生准备学习材料。为了方便起见，我想从URL访问数据，而不是要求他们提前下载。在本例中，我尝试从快速绘制访问！谷歌数据集下面是访问远程存储的数据并将结果注释掉的工作示例： import pandas as pd import os import json from glob import glob # Convert top row to one dict top_row_dict = lambda in_df: list(in_df.head(1).T.to_dict().va

我正在为我的学生准备学习材料。为了方便起见，我想从URL访问数据，而不是要求他们提前下载。在本例中，我尝试从快速绘制访问！谷歌数据集

下面是访问远程存储的数据并将结果注释掉的工作示例：

import pandas as pd
import os
import json
from glob import glob

# Convert top row to one dict
top_row_dict = lambda in_df: list(in_df.head(1).T.to_dict().values())[0]
# Load file from computer
base_dir = os.path.join('input', 'quickdraw_simplified')
obj_files = glob(os.path.join(base_dir, '*.ndjson'))
print(obj_files[0])
# input\quickdraw_simplified\full_simplified_bird.ndjson

c_json = pd.read_json(obj_files[0], lines = True, chunksize = 1)
# <pandas.io.json._json.JsonReader at 0x158ae631f10>

f_row = next(c_json)
# word  countrycode     timestamp   recognized  key_id  drawing
# 0     bird    US  2017-03-09 00:28:55.637750+00:00    True    4926006882205696    [[[0, 11, 23, 50, 72, 96, 97, 132, 158, 224, 2...

f_dict = top_row_dict(f_row)
# {'word': 'bird',
#  'countrycode': 'US',
#  'timestamp': Timestamp('2017-03-09 00:28:55.637750+0000', tz='UTC'),
#  'recognized': True,
#  'key_id': 4926006882205696,
#  'drawing': [[[0, 11, 23, 50, 72, 96, 97, 132, 158, 224, 255],
#    [22, 9, 2, 0, 26, 45, 71, 40, 27, 10, 9]]]}

将熊猫作为pd导入
导入操作系统
导入json
从全局导入全局
#将顶行转换为一个dict
top_row_dict=列表中的lambda（在头（1）.T.到_dict（）.values（））[0]
#从计算机加载文件
base\u dir=os.path.join（'input'，'quickdraw\u simplified'）
obj_files=glob（os.path.join（base_dir，'.*.ndjson'））
打印（obj_文件[0]）
#input\quickdraw\u simplified\full\u simplified\u bird.ndjson
c_json=pd.read_json（obj_文件[0]，line=True，chunksize=1）
# 
f_row=next（c_json）
#word countrycode时间戳识别的关键字id绘图
#0伯德美国2017-03-09 00:28:55.637750+00:00真实值4926006882205696[[0,11,23,50,72,96,97,132,158,224,2]。。。
f_dict=顶行_dict（f_行）
#{'word'：'bird'，
#“国家代码”：“美国”，
#“时间戳”：时间戳（'2017-03-09 00:28:55.637750+0000'，tz='UTC'），
#“公认”：正确，
#“密钥id”：4926006882205696，
#“图纸”：[0,11,23,50,72,96,97,132,158,224,255]，
#    [22, 9, 2, 0, 26, 45, 71, 40, 27, 10, 9]]]}

但是，当我尝试使用相同的方法时，它失败了：

import pandas as pd
import json

top_row_dict = lambda in_df: list(in_df.head(1).T.to_dict().values())[0]

url = 'https://console.cloud.google.com/storage/browser/quickdraw_dataset/full/simplified/bird.ndjson'
# Load dataset
c_json = pd.read_json(url, lines = True, chunksize = 1)
# <pandas.io.json._json.JsonReader at 0x24980a20a90>
f_row = next(c_json)
# __
f_dict = top_row_dict(f_row)
# IndexError: list index out of range

将熊猫作为pd导入
导入json
top_row_dict=列表中的lambda（在头（1）.T.到_dict（）.values（））[0]
url='1〕https://console.cloud.google.com/storage/browser/quickdraw_dataset/full/simplified/bird.ndjson'
#加载数据集
c_json=pd.read_json（url，lines=True，chunksize=1）
# 
f_row=next（c_json）
# __
f_dict=顶行_dict（f_行）
#索引器：列表索引超出范围

您尝试使用的URL需要登录（因为它链接到云控制台）

但是，数据集存储在一个可公开访问的Google云存储桶中

这意味着您可以使用包直接从bucket加载文件

比如：

从google.cloud导入存储
client=storage.client（）
bucket=client.get\u bucket（'quickdraw\u dataset'））
blob=bucket.get\u blob（'full/simplified/bird.ndjson'）
c_json=pd.read_json（blob，lines=True，chunksize=1）
...

不幸的是，它仍在请求凭据。：（DefaultCredentialsError:无法自动确定凭据。请设置GOOGLE\u应用程序\u凭据或显式创建凭据，然后重新运行应用程序。有关详细信息，请参阅