Python 如何将这两个结果结合起来，并在ApacheBeam管道中将其传递到下一步_Python_Google Cloud Dataflow_Apache Beam Io_Apache Beam

Python 如何将这两个结果结合起来，并在ApacheBeam管道中将其传递到下一步

python google-cloud-dataflow

Python 如何将这两个结果结合起来，并在ApacheBeam管道中将其传递到下一步,python,google-cloud-dataflow,apache-beam-io,apache-beam,Python,Google Cloud Dataflow,Apache Beam Io,Apache Beam,请参阅下面的代码片段，我希望[“metric1”、“metric2”]成为我对RunTask.process的输入。但是，它分别使用“metric1”和“metric2”运行了两次 def run(): pipeline_options = PipelineOptions(pipeline_args) pipeline_options.view_as(SetupOptions).save_main_session = save_main_session p = beam.Pi

请参阅下面的代码片段，我希望

[“metric1”、“metric2”]

成为我对RunTask.process的输入。但是，它分别使用“metric1”和“metric2”运行了两次

def run():
  
  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  p = beam.Pipeline(options=pipeline_options)

  root = p | 'Get source' >> beam.Create([
      "source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
  ])

  metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
  metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]

  metric3 = (metric1, metric2) | beam.Flatten() | beam.ParDo(RunTask()) # I want ["metric1", "metric2"] to be my input for RunTask.process. However it was run twice with "metric1" and "metric2" respectively

我知道您希望以如下语法连接两个PCollection:['element1'，'element2']。为了实现这一点，您可以使用而不是

考虑到您的代码片段，语法将：

def run():
  
  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  p = beam.Pipeline(options=pipeline_options)

  root = p | 'Get source' >> beam.Create([
      "source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
  ])

  metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
  metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]

  metric3 = (
       (metric1, metric2) 
       | beam.CoGroupByKey() 
       | beam.ParDo(RunTask()) 
 )

我想指出flatte（）和CoGroupByKey（）之间的区别

1）展平（）接收两个或多个存储相同数据类型的PCollection，并将它们合并为一个逻辑PCollection。比如说,

import apache_beam as beam

from apache_beam import Flatten, Create, ParDo, Map

p = beam.Pipeline()

adress_list = [
    ('leo', 'George St. 32'),
    ('ralph', 'Pyrmont St. 30'),
    ('mary', '10th Av.'),
    ('carly', 'Marina Bay 1'),
]
city_list = [
    ('leo', 'Sydney'),
    ('ralph', 'Sydney'),
    ('mary', 'NYC'),
    ('carly', 'Brisbane'),
]

street = p | 'CreateEmails' >> beam.Create(adress_list)
city = p | 'CreatePhones' >> beam.Create(city_list)

resul =(
    (street,city)
    |beam.Flatten()
    |ParDo(print)
)

p.run()

以及产量,

('leo', 'George St. 32')
('ralph', 'Pyrmont St. 30')
('mary', '10th Av.')
('carly', 'Marina Bay 1')
('leo', 'Sydney')
('ralph', 'Sydney')
('mary', 'NYC')
('carly', 'Brisbane')

('leo', (['George St. 32'], ['Sydney']))
('ralph', (['Pyrmont St. 30'], ['Sydney']))
('mary', (['10th Av.'], ['NYC']))
('carly', (['Marina Bay 1'], ['Brisbane']))

请注意，两个PCollection都在输出中。但是，一个附加到另一个

2）CoGroupByKey（）在具有相同键类型的两个或多个键值PCollection之间执行关系联接。使用此方法，您将按键执行联接，而不是像在展平（）中那样进行追加。下面是一个例子

import apache_beam as beam

from apache_beam import Flatten, Create, ParDo, Map

p = beam.Pipeline()

address_list = [
    ('leo', 'George St. 32'),
    ('ralph', 'Pyrmont St. 30'),
    ('mary', '10th Av.'),
    ('carly', 'Marina Bay 1'),
]
city_list = [
    ('leo', 'Sydney'),
    ('ralph', 'Sydney'),
    ('mary', 'NYC'),
    ('carly', 'Brisbane'),
]

street = p | 'CreateEmails' >> beam.Create(address_list)
city = p | 'CreatePhones' >> beam.Create(city_list)

results = (
    (street, city)
    | beam.CoGroupByKey()
    |ParDo(print)
    #| beam.io.WriteToText('delete.txt')
    
)

p.run()

以及产量,

('leo', 'George St. 32')
('ralph', 'Pyrmont St. 30')
('mary', '10th Av.')
('carly', 'Marina Bay 1')
('leo', 'Sydney')
('ralph', 'Sydney')
('mary', 'NYC')
('carly', 'Brisbane')

('leo', (['George St. 32'], ['Sydney']))
('ralph', (['Pyrmont St. 30'], ['Sydney']))
('mary', (['10th Av.'], ['NYC']))
('carly', (['Marina Bay 1'], ['Brisbane']))

请注意，您需要一个主键才能加入结果。此外，此输出是您在案例中所期望的。

或者，使用侧输入：

metrics3 = metric1 | beam.ParDo(RunTask(), metric2=beam.pvalue.AsIter(metric2))

在RunTask进程（）中：