Python 如何将json嵌套数据加载到bigquery中
我试图将json数据从API加载到GCP上的bigquery表中,但是我遇到了一个问题,json数据似乎缺少一个方括号,因此它得到了一个错误“重复记录,名为trip_update,添加到数组外部。”}]。我不知道怎么做 以下是数据示例:Python 如何将json嵌套数据加载到bigquery中,python,json,google-cloud-platform,google-bigquery,Python,Json,Google Cloud Platform,Google Bigquery,我试图将json数据从API加载到GCP上的bigquery表中,但是我遇到了一个问题,json数据似乎缺少一个方括号,因此它得到了一个错误“重复记录,名为trip_update,添加到数组外部。”}]。我不知道怎么做 以下是数据示例: { "header": { "gtfs_realtime_version": "1.0", "timestamp": 1607630971
{
"header": {
"gtfs_realtime_version": "1.0",
"timestamp": 1607630971
},
"entity": [
{
"id": "65.5.17-120-cm1-1.18.O",
"trip_update": {
"trip": {
"trip_id": "65.5.17-120-cm1-1.18.O",
"start_time": "18:00:00",
"start_date": "20201210",
"schedule_relationship": "SCHEDULED",
"route_id": "17-120-cm1-1"
},
"stop_time_update": [
{
"stop_sequence": 1,
"departure": {
"delay": 0
},
"stop_id": "8220B1351201",
"schedule_relationship": "SCHEDULED"
},
{
"stop_sequence": 23,
"arrival": {
"delay": 2340
},
"departure": {
"delay": 2340
},
"stop_id": "8260B1025301",
"schedule_relationship": "SCHEDULED"
}
]
}
}
]
}
下面是一个模式和代码:
模式
功能(遵循谷歌指南)
您的架构定义是错误的
trip\u update
不是重复的结构,而是可为空的记录(或不可为空,但不重复)
to BigQuery的一个限制是它不支持JSON中的映射或字典。我认为“trip\u update”和“trip”字段必须包含一个值数组(用方括号表示),与“stop\u time\u update”相同 我不确定这是否足以完美地加载您的数据。
您的示例行在JSON行的中间有许多换行符,当您从JSON文件加载数据时,行必须是换行分隔的。BigQuery要求以换行符分隔的JSON文件每行包含一条记录(解析器试图将每行解释为单独的JSON行)(。
您的JSON数据文件应该是什么样子。是的,在trip_更新后,JSON数据缺少一个方括号,但它是我从公共API()请求的原始格式。因此,我正在寻找能够读取给定格式的解决方案
[
{ "name":"header",
"type": "record",
"fields": [
{ "name":"gtfs_realtime_version",
"type": "string",
"description": "version of speed specification"
},
{ "name": "timestamp",
"type": "integer",
"description": "The moment where this dataset was generated on server e.g. 1593102976"
}
]
},
{"name":"entity",
"type": "record",
"mode": "REPEATED",
"description": "Multiple entities can be included in the feed",
"fields": [
{"name":"id",
"type": "string",
"description": "unique identifier for the entity"
},
{"name": "trip_update",
"type": "struct",
"mode": "REPEATED",
"description": "Data about the realtime departure delays of a trip. At least one of the fields trip_update, vehicle, or alert must be provided - all these fields cannot be empty.",
"fields": [
{ "name":"trip",
"type": "record",
"mode": "REPEATED",
"fields": [
{"name": "trip_id",
"type": "string",
"description": "selects which GTFS entity (trip) will be affected"
},
{ "name":"start_time",
"type": "string",
"description": "The initially scheduled start time of this trip instance 13:30:00"
},
{ "name":"start_date",
"type": "string",
"description": "The start date of this trip instance in YYYYMMDD format. Whether start_date is required depends on the type of trip: e.g. 20200625"
},
{ "name":"schedule_relationship",
"type": "string",
"description": "The relation between this trip and the static schedule e.g. SCHEDULED"
},
{ "name":"route_id",
"type": "string",
"description": "The route_id from the GTFS feed that this selector refers to e.g. 10-263-e16-1"
}
]
}
]
},
{ "name":"stop_time_update",
"type": "record",
"mode": "REPEATED",
"description": "Updates to StopTimes for the trip (both future, i.e., predictions, and in some cases, past ones, i.e., those that already happened). The updates must be sorted by stop_sequence, and apply for all the following stops of the trip up to the next specified stop_time_update. At least one stop_time_update must be provided for the trip unless the trip.schedule_relationship is CANCELED - if the trip is canceled, no stop_time_updates need to be provided.",
"fields": [
{"name":"stop_sequence",
"type": "string",
"description": "Must be the same as in stop_times.txt in the corresponding GTFS feed e.g 3"
},
{ "name":"arrival",
"type": "record",
"mode": "REPEATED",
"fields": [
{ "name":"delay",
"type": "string",
"description": "Delay (in seconds) can be positive (meaning that the vehicle is late) or negative (meaning that the vehicle is ahead of schedule). Delay of 0 means that the vehicle is exactly on time e.g 5"
}
]
},
{ "name": "departure",
"type": "record",
"mode": "REPEATED",
"fields": [
{ "name":"delay",
"type": "integer"
}
]
},
{ "name":"stop_id",
"type": "string",
"description": "Must be the same as in stops.txt in the corresponding GTFS feed e.g. 8430B2552301"
},
{"name":"schedule_relationship",
"type": "string",
"description": "The relation between this StopTime and the static schedule e.g. SCHEDULED , SKIPPED or NO_DATA"
}
]
}
]
}
]
def _insert_into_bigquery(bucket_name, file_name):
blob = CS.get_bucket(bucket_name).blob(file_name)
row = json.loads(blob.download_as_string())
table = BQ.dataset(BQ_DATASET).table(BQ_TABLE)
errors = BQ.insert_rows_json(table,
json_rows=row,
ignore_unknown_values=True,
retry=retry.Retry(deadline=30))
if errors != []:
raise BigQueryError(errors)
{"name": "trip_update",
"type": "record",
"mode": "NULLABLE",
"trip_update": [
{
"trip": [
{
"trip_id