Python 如何将这两个结果结合起来,并在ApacheBeam管道中将其传递到下一步
请参阅下面的代码片段, 我希望Python 如何将这两个结果结合起来,并在ApacheBeam管道中将其传递到下一步,python,google-cloud-dataflow,apache-beam-io,apache-beam,Python,Google Cloud Dataflow,Apache Beam Io,Apache Beam,请参阅下面的代码片段, 我希望[“metric1”、“metric2”]成为我对RunTask.process的输入。但是,它分别使用“metric1”和“metric2”运行了两次 def run(): pipeline_options = PipelineOptions(pipeline_args) pipeline_options.view_as(SetupOptions).save_main_session = save_main_session p = beam.Pi
[“metric1”、“metric2”]
成为我对RunTask.process的输入。但是,它分别使用“metric1”和“metric2”运行了两次
def run():
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
root = p | 'Get source' >> beam.Create([
"source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
])
metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]
metric3 = (metric1, metric2) | beam.Flatten() | beam.ParDo(RunTask()) # I want ["metric1", "metric2"] to be my input for RunTask.process. However it was run twice with "metric1" and "metric2" respectively
我知道您希望以如下语法连接两个PCollection:['element1','element2']。为了实现这一点,您可以使用而不是 考虑到您的代码片段,语法将:
def run():
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
root = p | 'Get source' >> beam.Create([
"source_name" # maybe ["source_name"] makes more sense since my process function takes an array as an input?
])
metric1 = root | "compute1" >> beam.ParDo(RunLongCompute(myarg="1")) #let's say it returns ["metic1"]
metric2 = root | "compute2" >> beam.ParDo(RunLongCompute(myarg="2")) #let's say it returns ["metic2"]
metric3 = (
(metric1, metric2)
| beam.CoGroupByKey()
| beam.ParDo(RunTask())
)
我想指出flatte()和CoGroupByKey()之间的区别
1)展平()接收两个或多个存储相同数据类型的PCollection,并将它们合并为一个逻辑PCollection。比如说,
import apache_beam as beam
from apache_beam import Flatten, Create, ParDo, Map
p = beam.Pipeline()
adress_list = [
('leo', 'George St. 32'),
('ralph', 'Pyrmont St. 30'),
('mary', '10th Av.'),
('carly', 'Marina Bay 1'),
]
city_list = [
('leo', 'Sydney'),
('ralph', 'Sydney'),
('mary', 'NYC'),
('carly', 'Brisbane'),
]
street = p | 'CreateEmails' >> beam.Create(adress_list)
city = p | 'CreatePhones' >> beam.Create(city_list)
resul =(
(street,city)
|beam.Flatten()
|ParDo(print)
)
p.run()
以及产量,
('leo', 'George St. 32')
('ralph', 'Pyrmont St. 30')
('mary', '10th Av.')
('carly', 'Marina Bay 1')
('leo', 'Sydney')
('ralph', 'Sydney')
('mary', 'NYC')
('carly', 'Brisbane')
('leo', (['George St. 32'], ['Sydney']))
('ralph', (['Pyrmont St. 30'], ['Sydney']))
('mary', (['10th Av.'], ['NYC']))
('carly', (['Marina Bay 1'], ['Brisbane']))
请注意,两个PCollection都在输出中。但是,一个附加到另一个
2)CoGroupByKey()在具有相同键类型的两个或多个键值PCollection之间执行关系联接。使用此方法,您将按键执行联接,而不是像在展平()中那样进行追加。下面是一个例子
import apache_beam as beam
from apache_beam import Flatten, Create, ParDo, Map
p = beam.Pipeline()
address_list = [
('leo', 'George St. 32'),
('ralph', 'Pyrmont St. 30'),
('mary', '10th Av.'),
('carly', 'Marina Bay 1'),
]
city_list = [
('leo', 'Sydney'),
('ralph', 'Sydney'),
('mary', 'NYC'),
('carly', 'Brisbane'),
]
street = p | 'CreateEmails' >> beam.Create(address_list)
city = p | 'CreatePhones' >> beam.Create(city_list)
results = (
(street, city)
| beam.CoGroupByKey()
|ParDo(print)
#| beam.io.WriteToText('delete.txt')
)
p.run()
以及产量,
('leo', 'George St. 32')
('ralph', 'Pyrmont St. 30')
('mary', '10th Av.')
('carly', 'Marina Bay 1')
('leo', 'Sydney')
('ralph', 'Sydney')
('mary', 'NYC')
('carly', 'Brisbane')
('leo', (['George St. 32'], ['Sydney']))
('ralph', (['Pyrmont St. 30'], ['Sydney']))
('mary', (['10th Av.'], ['NYC']))
('carly', (['Marina Bay 1'], ['Brisbane']))
请注意,您需要一个主键才能加入结果。此外,此输出是您在案例中所期望的。或者,使用侧输入:
metrics3 = metric1 | beam.ParDo(RunTask(), metric2=beam.pvalue.AsIter(metric2))
在RunTask进程()中: