Python 这是将Apache Beam PCollection写入多个接收器的正确方法吗？_Python_Mongodb_Apache Beam

Python 这是将Apache Beam PCollection写入多个接收器的正确方法吗？

python mongodb

Python 这是将Apache Beam PCollection写入多个接收器的正确方法吗？,python,mongodb,apache-beam,Python,Mongodb,Apache Beam,我正在构建一个最终写入两个接收器（MongoDB、BigQuery）的管道。我在下面加入了一个管道片段，它给了我一些问题。发生的情况如下：将文件内容（json对象）读入PCollection元素，然后应用一系列转换，生成另一个名为transformed的PCollection。此PCollection已转换写入MongoDB，没有任何问题。现在，在将其写入BigQuery之前，我对transformedPCollection应用一个额外的转换。以下是执行管道时发生错误的地方： TypeError

我正在构建一个最终写入两个接收器（MongoDB、BigQuery）的管道。我在下面加入了一个管道片段，它给了我一些问题。发生的情况如下：将文件内容（json对象）读入PCollection

元素

，然后应用一系列转换，生成另一个名为

transformed

的PCollection。此PCollection

已转换

写入MongoDB，没有任何问题。现在，在将其写入BigQuery之前，我对

transformed

PCollection应用一个额外的转换。以下是执行管道时发生错误的地方：

TypeError:无法将ObjectId（'5ee110559926384724ff5a83'）转换为JSON值。[在运行'WriteToBigQuery/_StreamToBigQuery/StreamInsertRows/ParDo（BigQueryWriteFn）'时]

我发现，当我编写MongoDb时，它会自动为插入的每个文档添加一个属性“\u id”（这里没有问题）。但不知何故，稍后当我尝试写入BigQuery时，PCollection

transformed

中的元素现在有了这个额外的“\u id”属性。这有多奇怪？PCollections应该是不可变的，对吗

到目前为止，我一直在尝试——注释它写入BigQuery的部分，看看会发生什么。当我这样做时，它成功地将PCollection

transformed

写入MongoDB，但出现了另一个奇怪的错误：

线程-18中的异常：回溯（最近一次呼叫最后一次）：文件“/Users/user/anaconda3/envs/project/lib/python3.7/threading.py”，第926行，在内部引导中 self.run（）文件“/Users/user/anaconda3/envs/project/lib/python3.7/threading.py”，第1177行，运行中 self.function（*self.args，**self.kwargs）文件“/Users/user/anaconda3/envs/project/lib/python3.7/site packages/apache_beam/runners/direct/sdf_direct_runner.py”，第467行，在initiate_checkpoint中 checkpoint\u state.residual\u restriction=tracker.checkpoint（） AttributeError:“\u SDFBoundedSourceRestrictionTracker”对象没有属性“checkpoint”

elements, files_read = (
                p
                | 'ReadFromGCS' >> beam.io.ReadFromTextWithFilename(file_pattern=file_pattern, coder=JsonCoder())
                | 'aTransformWithTaggedOutput' >> beam.ParDo(aTransform()).with_outputs('taggedOutputFilesRead',
                                                                        main='elements')
        )

deferred_side_input_1 = beam.pvalue.AsIter((
                p
                | 'QueryFromBigQueryTable' >> beam.io.Read(beam.io.BigQuerySource(query='SELECT col1 from dataset.table'))
        ))

deferred_side_input_2 = beam.pvalue.AsIter((
                p
                | 'ReadFromBigQueryTable' >> beam.io.Read(beam.io.BigQuerySource(dataset='bq_dataset', table='bq_table'))
        ))

transformed, tagged_output = (
                    elements
                    | 'Series' >> beam.ParDo(aTransform())
                    | 'of' >> beam.ParDo(anotherTransform())
                    | 'transforms' >> beam.ParDo(anotherTransform())
                    | '...' >> beam.ParDo(anotherTransform())
                    | '...' >> beam.ParDo(anotherTransform())
                    | '...' >> beam.ParDo(anotherTransform(), deferred_side_input1)
                    | 'transformWithTaggedOutput' >> beam.ParDo(transformWithTaggedOutput(), deferred_side_input_2).with_outputs('tagged_output',
                                                                                 main='transformed')
            )


"""Write `transformed` PCollection to MongoDB"""
transformed | 'WriteToMongo' >> beam.io.WriteToMongoDB(uri='mongoURI',
                                                       db='mongoDB',
                                                       coll='mongoCollection')

"""Perform an additional transform to `transformed` PCollection, Write to BigQuery"""
_ = (
transformed
| 'AdditionalTransform' >> beam.ParDo(additionalTransform())
| 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
                      table='bigqueryTable',
                      dataset='bigqueryDataset',
                      schema=beam.io.gcp.bigquery_tools.parse_table_schema_from_json(bq_schema),
                      create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                      write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                      validate=True)
            )

"""(No issues with this write to BQ) Write `tagged_output` PCollection to BigQuery"""
    tagged_output | 'WriteTaggedOutputToBigQuery' >> beam.io.WriteToBigQuery(
        table='other_bq_table,
        dataset='bq_dataset,
        schema=beam.io.gcp.bigquery_tools.parse_table_schema_from_json(other_bq_schema),
        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
        validate=True)

@巴勃罗你能借用你的专长吗？@Pablo你能借用你的专长吗？