Google cloud dataflow 在Apache Beam中添加2个Dofn之间的依赖关系_Google Cloud Dataflow_Apache Beam

Google cloud dataflow 在Apache Beam中添加2个Dofn之间的依赖关系

google-cloud-dataflow

Google cloud dataflow 在Apache Beam中添加2个Dofn之间的依赖关系,google-cloud-dataflow,apache-beam,Google Cloud Dataflow,Apache Beam,是否有任何方法可以在2个Dofn之间创建依赖关系，以便它将等待第一个Dofn方法完成，然后第二个Dofn方法将运行。只是想知道我们如何才能实现这个用例。可能有一种更干净的方法可以做到这一点，但我注意到，执行以下操作将达到您想要的效果：将第一个DoFn的输出路由到一个计数器，然后将该计数器的输出作为侧输入传递到第二个DoFn的ParDo class DoFn2(apache_beam.DoFn): def process(self, element, count_do_fn_1_out

是否有任何方法可以在2个Dofn之间创建依赖关系，以便它将等待第一个Dofn方法完成，然后第二个Dofn方法将运行。

只是想知道我们如何才能实现这个用例。

可能有一种更干净的方法可以做到这一点，但我注意到，执行以下操作将达到您想要的效果：

将第一个DoFn的输出路由到一个计数器，然后将该计数器的输出作为侧输入传递到第二个DoFn的ParDo

class DoFn2(apache_beam.DoFn):
    def process(self, element, count_do_fn_1_output, *args, **kwargs):
        # ...

do_fn_1_output = do_fn_1_input | 'do fn 1' >> apache_beam.ParDo(DoFn1())

count_do_fn_1_output = (
    do_fn_1_output 
    | 'count do_fn_1_output' >> apache_beam.combiners.Count.Globally())

do_fn_2_output = (
    do_fn_1_output 
    | 'do fn 2' >> apache_beam.ParDo(DoFn2(), count_do_fn_1_output=apache_beam.pvalue.AsSingleton(count_do_fn_1_output)))

对于JavaSDK，我建议看一下

Wait

transform。与我猜的您想要实现的目标类似。

您所说的“完成”是什么意思？我的意思是第二个dofn应该等到第一个dofn完成。您所说的“完成”是指在第二个dofn运行任何元素之前运行所有元素？如果是这样，下面的答案是好的。另一个选项是在两个Dofn之间插入beam.util.Reshuffle（）。如果您的Dofn输出非常大，那么将此结果作为一个侧面输入发送是没有意义的，它可能会妨碍您的整体管道性能。侧面输入是输出的计数，而不是整个PCCollection的ArrayBy。等待dofn1通常会妨碍您的性能。您甚至可以使用从不发射任何东西的

beam.FlatMap（lambda x:None）

来“处理”您的PCollection，并将其用作侧面输入。