Google cloud platform 合并两个PCollection（Apache beam）_Google Cloud Platform_Google Cloud Dataflow_Apache Beam_Google Cloud Pubsub

Google cloud platform 合并两个PCollection（Apache beam）

google-cloud-platform google-cloud-dataflow

Google cloud platform 合并两个PCollection（Apache beam）,google-cloud-platform,google-cloud-dataflow,apache-beam,google-cloud-pubsub,Google Cloud Platform,Google Cloud Dataflow,Apache Beam,Google Cloud Pubsub,我在云存储中有两个文件。包含一个Avro格式的文件1，其中包含来自温度传感器的数据 time_stamp | Temperature 1000 | T1 2000 | T2 3000 | T3 4000 | T3 5000 | T4 6000 | T5 包含Avro格式的文件2，其中包含来自风传感器的数据 time_stamp | wind_s

我在云存储中有两个文件。包含一个Avro格式的文件1，其中包含来自温度传感器的数据

time_stamp     |  Temperature
1000           |  T1
2000           |  T2
3000           |  T3
4000           |  T3
5000           |  T4
6000           |  T5

包含Avro格式的文件2，其中包含来自风传感器的数据

time_stamp     |  wind_speed
500            |  w1
1200           |  w2
1500           |  w3
2200           |  w4
2500           |  w5
3000           |  w6

我想结合如下输出

time_stamp |Temperature|wind_speed
1000       |T1         |w1 (last earliest reading from wind sensor at 500)
2000       |T2         |w3 (last earliest reading from wind sensor at 1500)
3000       |T3         |w6 (wind sensor reading at 3000)
4000       |T3         |w6 (last earliest reading from wind sensor at 3000)
5000       |T4         |w6 (last earliest reading from wind sensor at 3000)
6000       |T5         |w6(last earliest reading from wind sensor at 3000)

我正在寻找apache beam中的解决方案来合并上述文件。现在它正在从文件中读取，但将来可能会通过pubsub来读取。我想找出组合两个PCollection的自定义方法，并使用WindSpeed创建另一个PCollection TempData

     PCollection<Temperature> tempData = p.apply(AvroIO
         .read(AvroAutoGenClass.class)
         .from("gs://my_bucket/path/to/temp-sensor-data.avro")

     PCollection<WindSpeed> windData = p.apply(AvroIO
         .read(AvroAutoGenClass.class)
         .from("gs://my_bucket/path/to/wind-sensor-data.avro")

     PCollection<WindSpeed> tempDataWithWindSpeed = ?

PCollection tempData=p.apply（AvroIO
.read（AvroAutoGenClass.class）
.from（“gs://my_bucket/path/to/temp sensor data.avro”）
PCollection windData=p.apply（AvroIO
.read（AvroAutoGenClass.class）
.from（“gs://my_bucket/path/to/wind sensor data.avro”）
PCollection tempDataWithWindSpeed=？

对于Dataflow/Beam，@jszule的评论通常是一个很好的答案：当两个PCollection有一个公共键时，最受支持的联接是。对于大多数数据Beam，可以找出一个模式，并且可以使用transform。您必须做出的设计决策是如何选择键，例如四舍五入到最接近的1000

您的用例有一个复杂的问题：您需要在没有数据的键的时间序列中结转值。解决方案是使用状态和计时器来生成“缺失的”值。您仍然需要仔细选择键，因为状态和计时器是每个键和窗口的。状态和计时器也在批处理模式下工作，因此这是一个批处理/流式统一解决方案

您可能想阅读有关该主题的内容，或者有几种解决方案。您可以添加更多详细信息吗？例如，温度时间戳是否与示例中显示的一样规则？是流处理还是总是批处理？在管道中进行合并后，您会执行许多额外的转换吗？什么类型的转换？这里是一个很好的示例e、如何加入他们：@guillaumeblaquiere转换对解决方案有何影响。现在是批处理。@jszule我看到了这个例子，使用的连接键是用户名。我没有直接的连接键，我需要一些自定义的解决方案来加入。你仍然可以加入源代码，只需创建一个KV值，如

KV

和 KV，其中键是时间戳所属的两种情况下的时间戳箱（例如：2200属于键2000，因此您必须舍入到数千）。创建组后，您可以选择最小值或最大值或任何需要的传感器值。希望这有帮助：）