Python 这是将Apache Beam PCollection写入多个接收器的正确方法吗?

Python 这是将Apache Beam PCollection写入多个接收器的正确方法吗?,python,mongodb,apache-beam,Python,Mongodb,Apache Beam,我正在构建一个最终写入两个接收器(MongoDB、BigQuery)的管道。我在下面加入了一个管道片段,它给了我一些问题。发生的情况如下:将文件内容(json对象)读入PCollection元素,然后应用一系列转换,生成另一个名为transformed的PCollection。此PCollection已转换写入MongoDB,没有任何问题。现在,在将其写入BigQuery之前,我对transformedPCollection应用一个额外的转换。以下是执行管道时发生错误的地方: TypeError

我正在构建一个最终写入两个接收器(MongoDB、BigQuery)的管道。我在下面加入了一个管道片段,它给了我一些问题。发生的情况如下:将文件内容(json对象)读入PCollection
元素
,然后应用一系列转换,生成另一个名为
transformed
的PCollection。此PCollection
已转换
写入MongoDB,没有任何问题。现在,在将其写入BigQuery之前,我对
transformed
PCollection应用一个额外的转换。以下是执行管道时发生错误的地方:

TypeError:无法将ObjectId('5ee110559926384724ff5a83')转换为JSON值。[在运行'WriteToBigQuery/_StreamToBigQuery/StreamInsertRows/ParDo(BigQueryWriteFn)'时]

我发现,当我编写MongoDb时,它会自动为插入的每个文档添加一个属性“\u id”(这里没有问题)。但不知何故,稍后当我尝试写入BigQuery时,PCollection
transformed
中的元素现在有了这个额外的“\u id”属性。这有多奇怪?PCollections应该是不可变的,对吗

到目前为止,我一直在尝试——注释它写入BigQuery的部分,看看会发生什么。当我这样做时,它成功地将PCollection
transformed
写入MongoDB,但出现了另一个奇怪的错误:

线程-18中的异常: 回溯(最近一次呼叫最后一次): 文件“/Users/user/anaconda3/envs/project/lib/python3.7/threading.py”,第926行,在内部引导中 self.run() 文件“/Users/user/anaconda3/envs/project/lib/python3.7/threading.py”,第1177行,运行中 self.function(*self.args,**self.kwargs) 文件“/Users/user/anaconda3/envs/project/lib/python3.7/site packages/apache_beam/runners/direct/sdf_direct_runner.py”,第467行,在initiate_checkpoint中 checkpoint\u state.residual\u restriction=tracker.checkpoint() AttributeError:“\u SDFBoundedSourceRestrictionTracker”对象没有属性“checkpoint”

elements, files_read = (
                p
                | 'ReadFromGCS' >> beam.io.ReadFromTextWithFilename(file_pattern=file_pattern, coder=JsonCoder())
                | 'aTransformWithTaggedOutput' >> beam.ParDo(aTransform()).with_outputs('taggedOutputFilesRead',
                                                                        main='elements')
        )

deferred_side_input_1 = beam.pvalue.AsIter((
                p
                | 'QueryFromBigQueryTable' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT col1 from dataset.table'))
        ))

deferred_side_input_2 = beam.pvalue.AsIter((
                p
                | 'ReadFromBigQueryTable' >> beam.io.Read(beam.io.BigQuerySource(dataset='bq_dataset', table='bq_table'))
        ))

transformed, tagged_output = (
                    elements
                    | 'Series' >> beam.ParDo(aTransform())
                    | 'of' >> beam.ParDo(anotherTransform())
                    | 'transforms' >> beam.ParDo(anotherTransform())
                    | '...' >> beam.ParDo(anotherTransform())
                    | '...' >> beam.ParDo(anotherTransform())
                    | '...' >> beam.ParDo(anotherTransform(), deferred_side_input1)
                    | 'transformWithTaggedOutput' >> beam.ParDo(transformWithTaggedOutput(), deferred_side_input_2).with_outputs('tagged_output',
                                                                                 main='transformed')
            )


"""Write `transformed` PCollection to MongoDB"""
transformed | 'WriteToMongo' >> beam.io.WriteToMongoDB(uri='mongoURI',
                                                       db='mongoDB',
                                                       coll='mongoCollection')

"""Perform an additional transform to `transformed` PCollection, Write to BigQuery"""
_ = (
transformed
| 'AdditionalTransform' >> beam.ParDo(additionalTransform())
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
                      table='bigqueryTable',
                      dataset='bigqueryDataset',
                      schema=beam.io.gcp.bigquery_tools.parse_table_schema_from_json(bq_schema),
                      create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                      write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                      validate=True)
            )

"""(No issues with this write to BQ) Write `tagged_output` PCollection to BigQuery"""
    tagged_output | 'WriteTaggedOutputToBigQuery' >> beam.io.WriteToBigQuery(
        table='other_bq_table,
        dataset='bq_dataset,
        schema=beam.io.gcp.bigquery_tools.parse_table_schema_from_json(other_bq_schema),
        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
        validate=True)

@巴勃罗你能借用你的专长吗?@Pablo你能借用你的专长吗?