Google cloud dataflow 创建DAG数据流（apache Beam）_Google Cloud Dataflow_Apache Beam

Google cloud dataflow 创建DAG数据流（apache Beam）

google-cloud-dataflow

Google cloud dataflow 创建DAG数据流（apache Beam）,google-cloud-dataflow,apache-beam,Google Cloud Dataflow,Apache Beam,我正在Dataflow（ApacheBeam）上创建一个管道来读取和写入Google BigQuery上的数据，但是我在创建DAG时遇到了一些问题，就像我在Airflow中所做的那样这是我的代码中的一个示例： # define pipeline p = beam.Pipeline(argv=pipeline_args) # execute query_1 query_result_gps = ( p | 'ReadFromBQ GPS_data' >> ... ) # write

我正在Dataflow（ApacheBeam）上创建一个管道来读取和写入Google BigQuery上的数据，但是我在创建DAG时遇到了一些问题，就像我在Airflow中所做的那样

这是我的代码中的一个示例：

# define pipeline
p = beam.Pipeline(argv=pipeline_args)
# execute query_1
query_result_gps = ( p | 'ReadFromBQ GPS_data' >> ... )
# write result from query_1 on BigQuery
output_gps = ( query_result_gps | 'WriteToBQ GPS_data' >> ... )
# execute query_2
query_result_temperature = (output_gps 
                                    | 'ReadFromBQ temperature_data' >> ... )
# write result from query_2
ouput_temperature = ( query_result_temperature | 'WriteToBQ temperature_data' >> ... )

我希望按顺序执行这些任务，而不是数据流并行执行它们

如何让它们按顺序执行？

我假设您正在阅读这样的大查询：

count = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource(known_args.input_table))

我深入研究了apache_beam源代码，看起来它们的源代码转换忽略了输入pcollection，这就是它们并行设置的原因

请参阅def expand（自我，pbegin）的最后一行：：

由于您希望将中间步骤输出到BigQuery并在两个转换之间流动数据，因此我认为分支将实现您想要的结果

PCollection_1=（从BQ读取）。应用（转换_1）

PCollection_1。应用（写入BQ）

PCollection_1应用（转换_2）.应用（写入BQ）

这将允许您在元素经过Transform_1之后对其应用Transform_2，并将中间步骤写入BQ。通过对同一个PCollection应用多个ParDo，可以在DAG中生成不同的分支

我需要按顺序执行的原因是，在某个时刻，我需要从上一个步骤中生成的表中读取数据，但如果上一个步骤生成了该表，那么数据流管道中是否已经存在该数据？感觉您应该尝试重新使用现有的pcollection，而不是再次读取它。我希望在BigQuery上有中间步骤（表）

class Read(ptransform.PTransform):
  """A transform that reads a PCollection."""

  def __init__(self, source):
    """Initializes a Read transform.

    Args:
      source: Data source to read from.
    """
    super(Read, self).__init__()
    self.source = source

  def expand(self, pbegin):
    from apache_beam.options.pipeline_options import DebugOptions
    from apache_beam.transforms import util

    assert isinstance(pbegin, pvalue.PBegin)
    self.pipeline = pbegin.pipeline

    debug_options = self.pipeline._options.view_as(DebugOptions)
    if debug_options.experiments and 'beam_fn_api' in debug_options.experiments:
      source = self.source

      def split_source(unused_impulse):
        total_size = source.estimate_size()
        if total_size:
          # 1MB = 1 shard, 1GB = 32 shards, 1TB = 1000 shards, 1PB = 32k shards
          chunk_size = max(1 << 20, 1000 * int(math.sqrt(total_size)))
        else:
          chunk_size = 64 << 20  # 64mb
        return source.split(chunk_size)

      return (
          pbegin
          | core.Impulse()
          | 'Split' >> core.FlatMap(split_source)
          | util.Reshuffle()
          | 'ReadSplits' >> core.FlatMap(lambda split: split.source.read(
              split.source.get_range_tracker(
                  split.start_position, split.stop_position))))
    else:
      # Treat Read itself as a primitive.
      return pvalue.PCollection(self.pipeline)

# ... other methods

class SequentialRead(Read):
  def expand(self, pbegin):
      return pbegin