如何使用ApacheBeam（Python）将多个嵌套JSON写入BigQuery表_Python_Google Bigquery_Apache Beam

如何使用ApacheBeam（Python）将多个嵌套JSON写入BigQuery表

python google-bigquery

如何使用ApacheBeam（Python）将多个嵌套JSON写入BigQuery表,python,google-bigquery,apache-beam,Python,Google Bigquery,Apache Beam,我正在使用Python从数据流向BigQuery表编写一个复杂的JSON对象集合。像下面这样手动创建表模式太复杂了，因为我的json对象嵌套在多个层中 from apache_beam.io.gcp.internal.clients import bigquery table_schema=bigquery.TableSchema() id_schema = bigquery.TableFieldSchema() id_schema.name = 'ID' id_schema.type =

我正在使用Python从数据流向BigQuery表编写一个复杂的JSON对象集合。像下面这样手动创建表模式太复杂了，因为我的json对象嵌套在多个层中

from apache_beam.io.gcp.internal.clients import bigquery

table_schema=bigquery.TableSchema()

id_schema = bigquery.TableFieldSchema()
id_schema.name = 'ID'
id_schema.type = 'integer'
id_schema.mode = 'nullable'
table_schema.fields.append(id_schema)
...

所以我尝试了我推荐的方法。首先，我在云控制台中运行以下命令以获取模式

bq --format=json show project:dataset.table > output_schema.json

然后我运行以下代码以获取表模式

table_schema = parse_table_schema_from_json(json.dumps(json.load(open("output_schema.json"))["schema"]))

这完全符合预期。该表最初是从Jupyter笔记本创建的，在那里我可以使用bigquery.LoadJobConfig和autodetect来写入bigquery，而不提供模式

现在，我使用Apache Beam pipeline尝试使用此模式写入BigQuery，但不知何故，我遇到了一些错误，如：

WARNING:apache_beam.io.gcp.bigquery:There were errors inserting to BigQuery. Will retry. Errors were [<InsertErrorsValueListEntry
 errors: [<ErrorProto
 debugInfo: ''
 location: 'sectiontokens.documents'
 message: 'Array specified for non-repeated field.'
 reason: 'invalid'>]
 index: 0>, <InsertErrorsValueListEntry
 errors: [<ErrorProto
 debugInfo: ''
 location: 'sectiontokens.errors'
 message: 'Array specified for non-repeated field.'
 reason: 'invalid'>]
 index: 1>, <InsertErrorsValueListEntry
 errors: [<ErrorProto
 debugInfo: ''
 location: 'sectiontokens.documents'
 message: 'Array specified for non-repeated field.'
 reason: 'invalid'>]
 index: 2>]

下面是一些示例数据：

{'ID': 123, 'SourceResourceID': 'Resource/3c81b4d2-3ee9-11eb-8bf6-0242ac100303', 'DocumentText': 'EXAM:  CT CHEST IC  \n\n\nPROCEDURE DATE:  12/11/2020  \n', 'DocumentName': 'CT CHEST IC', 'EncounterNumber': None, 'EncounterResourceID': 'Encounter/123', 'DocumentId': '123', 'DocumentDate': '2020-12-15 10:21:00 UTC', 'SectionTitle': 'physical_exam', 'SectionHeader': 'EXAM:', 'SectionText': 'EXAM:  CT CHEST IC  \n\n\nPROCEDURE DATE:  12/11/2020  \n \n\n\n', 'SectionTokens': {'documents': [{'id': '1', 'entities': [{'id': '0', 'offset': 7, 'length': 11, 'text': 'CT CHEST IC', 'category': 'ExaminationName', 'confidenceScore': 0.98, 'isNegated': False}]}], 'errors': [], 'modelVersion': '2020-09-03'}}

有人能帮我找出我做错了什么吗？谢谢。

在您的模式中，

sectiontokens.documents

和

sectiontokens.errors

被指定为类型记录，这意味着BigQuery希望该字段只有一条记录，但在您的数据中，这些键实际上是一个对象列表

如果要定义一列以接受对象列表，则需要有

“模式”：“重复”

在您的模式中，

sectiontokens.documents

和

sectiontokens.errors

被指定为类型记录，这意味着BigQuery希望该字段只有一条记录，但在您的数据中，这些键实际上是一个对象列表

如果要定义一列以接受对象列表，则需要有

“模式”：“重复”

{'ID': 123, 'SourceResourceID': 'Resource/3c81b4d2-3ee9-11eb-8bf6-0242ac100303', 'DocumentText': 'EXAM:  CT CHEST IC  \n\n\nPROCEDURE DATE:  12/11/2020  \n', 'DocumentName': 'CT CHEST IC', 'EncounterNumber': None, 'EncounterResourceID': 'Encounter/123', 'DocumentId': '123', 'DocumentDate': '2020-12-15 10:21:00 UTC', 'SectionTitle': 'physical_exam', 'SectionHeader': 'EXAM:', 'SectionText': 'EXAM:  CT CHEST IC  \n\n\nPROCEDURE DATE:  12/11/2020  \n \n\n\n', 'SectionTokens': {'documents': [{'id': '1', 'entities': [{'id': '0', 'offset': 7, 'length': 11, 'text': 'CT CHEST IC', 'category': 'ExaminationName', 'confidenceScore': 0.98, 'isNegated': False}]}], 'errors': [], 'modelVersion': '2020-09-03'}}