Python 这是将Apache Beam PCollection写入多个接收器的正确方法吗?
我正在构建一个最终写入两个接收器(MongoDB、BigQuery)的管道。我在下面加入了一个管道片段,它给了我一些问题。发生的情况如下:将文件内容(json对象)读入PCollectionPython 这是将Apache Beam PCollection写入多个接收器的正确方法吗?,python,mongodb,apache-beam,Python,Mongodb,Apache Beam,我正在构建一个最终写入两个接收器(MongoDB、BigQuery)的管道。我在下面加入了一个管道片段,它给了我一些问题。发生的情况如下:将文件内容(json对象)读入PCollection元素,然后应用一系列转换,生成另一个名为transformed的PCollection。此PCollection已转换写入MongoDB,没有任何问题。现在,在将其写入BigQuery之前,我对transformedPCollection应用一个额外的转换。以下是执行管道时发生错误的地方: TypeError
元素
,然后应用一系列转换,生成另一个名为transformed
的PCollection。此PCollection已转换
写入MongoDB,没有任何问题。现在,在将其写入BigQuery之前,我对transformed
PCollection应用一个额外的转换。以下是执行管道时发生错误的地方:
TypeError:无法将ObjectId('5ee110559926384724ff5a83')转换为JSON值。[在运行'WriteToBigQuery/_StreamToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn)'时]
我发现,当我编写MongoDb时,它会自动为插入的每个文档添加一个属性“\u id”(这里没有问题)。但不知何故,稍后当我尝试写入BigQuery时,PCollectiontransformed
中的元素现在有了这个额外的“\u id”属性。这有多奇怪?PCollections应该是不可变的,对吗
到目前为止,我一直在尝试——注释它写入BigQuery的部分,看看会发生什么。当我这样做时,它成功地将PCollectiontransformed
写入MongoDB,但出现了另一个奇怪的错误:
线程-18中的异常:
回溯(最近一次呼叫最后一次):
文件“/Users/user/anaconda3/envs/project/lib/python3.7/threading.py”,第926行,在内部引导中
self.run()
文件“/Users/user/anaconda3/envs/project/lib/python3.7/threading.py”,第1177行,运行中
self.function(*self.args,**self.kwargs)
文件“/Users/user/anaconda3/envs/project/lib/python3.7/site packages/apache_beam/runners/direct/sdf_direct_runner.py”,第467行,在initiate_checkpoint中
checkpoint\u state.residual\u restriction=tracker.checkpoint()
AttributeError:“\u SDFBoundedSourceRestrictionTracker”对象没有属性“checkpoint”
elements, files_read = (
p
| 'ReadFromGCS' >> beam.io.ReadFromTextWithFilename(file_pattern=file_pattern, coder=JsonCoder())
| 'aTransformWithTaggedOutput' >> beam.ParDo(aTransform()).with_outputs('taggedOutputFilesRead',
main='elements')
)
deferred_side_input_1 = beam.pvalue.AsIter((
p
| 'QueryFromBigQueryTable' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT col1 from dataset.table'))
))
deferred_side_input_2 = beam.pvalue.AsIter((
p
| 'ReadFromBigQueryTable' >> beam.io.Read(beam.io.BigQuerySource(dataset='bq_dataset', table='bq_table'))
))
transformed, tagged_output = (
elements
| 'Series' >> beam.ParDo(aTransform())
| 'of' >> beam.ParDo(anotherTransform())
| 'transforms' >> beam.ParDo(anotherTransform())
| '...' >> beam.ParDo(anotherTransform())
| '...' >> beam.ParDo(anotherTransform())
| '...' >> beam.ParDo(anotherTransform(), deferred_side_input1)
| 'transformWithTaggedOutput' >> beam.ParDo(transformWithTaggedOutput(), deferred_side_input_2).with_outputs('tagged_output',
main='transformed')
)
"""Write `transformed` PCollection to MongoDB"""
transformed | 'WriteToMongo' >> beam.io.WriteToMongoDB(uri='mongoURI',
db='mongoDB',
coll='mongoCollection')
"""Perform an additional transform to `transformed` PCollection, Write to BigQuery"""
_ = (
transformed
| 'AdditionalTransform' >> beam.ParDo(additionalTransform())
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
table='bigqueryTable',
dataset='bigqueryDataset',
schema=beam.io.gcp.bigquery_tools.parse_table_schema_from_json(bq_schema),
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
validate=True)
)
"""(No issues with this write to BQ) Write `tagged_output` PCollection to BigQuery"""
tagged_output | 'WriteTaggedOutputToBigQuery' >> beam.io.WriteToBigQuery(
table='other_bq_table,
dataset='bq_dataset,
schema=beam.io.gcp.bigquery_tools.parse_table_schema_from_json(other_bq_schema),
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
validate=True)
@巴勃罗你能借用你的专长吗?@Pablo你能借用你的专长吗?