Python Apache Beam中累积窗口和丢弃窗口的区别?
我这里有一个示例管道:Python Apache Beam中累积窗口和丢弃窗口的区别?,python,google-cloud-platform,google-cloud-dataflow,apache-beam,Python,Google Cloud Platform,Google Cloud Dataflow,Apache Beam,我这里有一个示例管道: def print_windows(element, window=beam.DoFn.WindowParam, pane_info=beam.DoFn.PaneInfoParam, timestamp=beam.DoFn.TimestampParam): print(window) print(pane_info) print(timestamp) print(element) print('-----------------'
def print_windows(element, window=beam.DoFn.WindowParam, pane_info=beam.DoFn.PaneInfoParam, timestamp=beam.DoFn.TimestampParam):
print(window)
print(pane_info)
print(timestamp)
print(element)
print('-----------------')
options = PipelineOptions()
with beam.Pipeline(options=options) as p:
keyed_elements = [
('USA', {'score': 1, 'timestamp': 2}),
('USA', {'score': 2, 'timestamp': 4}),
('USA', {'score': 3, 'timestamp': 4}),
('USA', {'score': 4, 'timestamp': 5}),
('USA', {'score': 5, 'timestamp': 14}),
('USA', {'score': 6, 'timestamp': 17}),
]
elements = (
p
| beam.Create(keyed_elements)
| 'ConvertIntoUserEvents' >> beam.Map(lambda e: beam.window.TimestampedValue(e, e[1]['timestamp']))
| beam.Map(lambda e: (e[0], e[1]['score']))
)
results = (
elements
| "" >> beam.WindowInto(
beam.window.FixedWindows(10),
trigger=Repeatedly(AfterCount(2)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.ACCUMULATING
)
| beam.CombinePerKey(beam.combiners.ToListCombineFn())
)
results | beam.ParDo(print_windows)
这个想法很简单——我想输入一些时间戳分数,并将它们合并到一个列表中。我在看到2个元素后触发每个窗格
如果按原样运行,我会得到:
[0.0, 10.0)
PaneInfo(first: True, last: False, timing: EARLY, index: 0, nonspeculative_index: -1)
Timestamp(9.999000)
('USA', [1, 2, 3, 4])
-----------------
[10.0, 20.0)
PaneInfo(first: True, last: False, timing: EARLY, index: 0, nonspeculative_index: -1)
Timestamp(19.999000)
('USA', [5, 6])
但是,如果我将累加模式更改为丢弃,输出将保持不变。我很困惑,因为根据我在高层次上的理解,积累会输出如下窗格:
[1,2]。。。[1,2,3,4]
用于前10秒窗口,然后[5,6]
用于最后10秒窗口
另一方面,丢弃应提供:
[1,2]。。[3,4]
然后[5,6]
。为什么输出相同 根据Beam概念,窗口可以包含0到N个窗格,由应用程序代码中的触发器定义控制
当触发器被定义为累积时,这意味着在启动新窗格或关闭窗口时,将保留作为窗口一部分并基于触发器逻辑触发的任何值,并将其附加到新值
当一个触发器被定义为丢弃时,它意味着任何作为窗口一部分并基于触发器逻辑被触发的值都被丢弃,并且在以下被触发的新窗格或窗口关闭时不可用
在上面的示例中,如果触发逻辑更改为以下,您可以观察至少两个窗格:-
累积
窗格,下面是行为
INFO:apache_beam.runners.portability.fn_api_runner:Running (CombinePerKey(ToListCombineFn)/GroupByKey/Read)+((CombinePerKey(ToListCombineFn)/Combine)+(ref_AppliedPTransform_ParDo(CallableWrapperDoFn)_26))
INFO:root:2020-05-24 14:10:00
INFO:root:2020-05-24 14:12:00
INFO:root:PaneInfo(first: True, last: False, timing: EARLY, index: 0, nonspeculative_index: -1)
INFO:root:Timestamp(1590329519.999000)
INFO:root:('USA', [{'score': 1, 'ts': 5}, {'score': 2, 'ts': 5}])
INFO:root:-----------------
INFO:root:2020-05-24 14:12:00
INFO:root:2020-05-24 14:14:00
INFO:root:PaneInfo(first: True, last: False, timing: EARLY, index: 0, nonspeculative_index: -1)
INFO:root:Timestamp(1590329639.999000)
INFO:root:('USA', [{'score': 5, 'ts': 105}, {'score': 4, 'ts': 60}, {'score': 6, 'ts': 105}, {'score': 3, 'ts': 60}])
INFO:root:-----------------
INFO:root:2020-05-24 14:10:00
INFO:root:2020-05-24 14:12:00
INFO:root:PaneInfo(first: False, last: True, timing: ON_TIME, index: 1, nonspeculative_index: 0)
INFO:root:Timestamp(1590329519.999000)
INFO:root:('USA', [{'score': 1, 'ts': 5}, {'score': 2, 'ts': 5}])
INFO:root:-----------------
INFO:root:2020-05-24 14:12:00
INFO:root:2020-05-24 14:14:00
INFO:root:PaneInfo(first: False, last: True, timing: ON_TIME, index: 1, nonspeculative_index: 0)
INFO:root:Timestamp(1590329639.999000)
INFO:root:('USA', [{'score': 5, 'ts': 105}, {'score': 4, 'ts': 60}, {'score': 6, 'ts': 105}, {'score': 3, 'ts': 60}])
INFO:root:-----------------
使用丢弃
窗格下面是行为
INFO:root:2020-05-24 14:12:00
INFO:root:2020-05-24 14:14:00
INFO:root:PaneInfo(first: True, last: False, timing: EARLY, index: 0, nonspeculative_index: -1)
INFO:root:Timestamp(1590329639.999000)
INFO:root:('USA', [{'score': 2, 'ts': 5}, {'score': 4, 'ts': 60}, {'score': 1, 'ts': 5}, {'score': 3, 'ts': 60}])
INFO:root:-----------------
INFO:root:2020-05-24 14:14:00
INFO:root:2020-05-24 14:16:00
INFO:root:PaneInfo(first: True, last: False, timing: EARLY, index: 0, nonspeculative_index: -1)
INFO:root:Timestamp(1590329759.999000)
INFO:root:('USA', [{'score': 5, 'ts': 105}, {'score': 6, 'ts': 105}])
INFO:root:-----------------
INFO:root:2020-05-24 14:12:00
INFO:root:2020-05-24 14:14:00
INFO:root:PaneInfo(first: False, last: True, timing: ON_TIME, index: 1, nonspeculative_index: 0)
INFO:root:Timestamp(1590329639.999000)
INFO:root:('USA', [])
INFO:root:-----------------
INFO:root:2020-05-24 14:14:00
INFO:root:2020-05-24 14:16:00
INFO:root:PaneInfo(first: False, last: True, timing: ON_TIME, index: 1, nonspeculative_index: 0)
INFO:root:Timestamp(1590329759.999000)
INFO:root:('USA', [])
INFO:root:-----------------
在累积
的情况下,当到达水印且窗口关闭时,将保留早期
窗格中的值,该窗口由ON_TIME
窗格表示
然而,在丢弃
窗格的情况下,早期
窗格中的值将被丢弃,开启时间
窗格为空
在真实场景中,元素通过Pub/Sub流超过1个早期窗格可能会被触发。在模拟场景中,由于所有值都已存在,因此不能触发超过1个早期窗格。根据Beam概念,窗口可以包含0到N个窗格,由应用程序代码中的触发器定义控制 当触发器被定义为累积时,这意味着在启动新窗格或关闭窗口时,将保留作为窗口一部分并基于触发器逻辑触发的任何值,并将其附加到新值 当一个触发器被定义为丢弃时,它意味着任何作为窗口一部分并基于触发器逻辑被触发的值都被丢弃,并且在以下被触发的新窗格或窗口关闭时不可用 在上面的示例中,如果触发逻辑更改为以下,您可以观察至少两个窗格:-
累积
窗格,下面是行为
INFO:apache_beam.runners.portability.fn_api_runner:Running (CombinePerKey(ToListCombineFn)/GroupByKey/Read)+((CombinePerKey(ToListCombineFn)/Combine)+(ref_AppliedPTransform_ParDo(CallableWrapperDoFn)_26))
INFO:root:2020-05-24 14:10:00
INFO:root:2020-05-24 14:12:00
INFO:root:PaneInfo(first: True, last: False, timing: EARLY, index: 0, nonspeculative_index: -1)
INFO:root:Timestamp(1590329519.999000)
INFO:root:('USA', [{'score': 1, 'ts': 5}, {'score': 2, 'ts': 5}])
INFO:root:-----------------
INFO:root:2020-05-24 14:12:00
INFO:root:2020-05-24 14:14:00
INFO:root:PaneInfo(first: True, last: False, timing: EARLY, index: 0, nonspeculative_index: -1)
INFO:root:Timestamp(1590329639.999000)
INFO:root:('USA', [{'score': 5, 'ts': 105}, {'score': 4, 'ts': 60}, {'score': 6, 'ts': 105}, {'score': 3, 'ts': 60}])
INFO:root:-----------------
INFO:root:2020-05-24 14:10:00
INFO:root:2020-05-24 14:12:00
INFO:root:PaneInfo(first: False, last: True, timing: ON_TIME, index: 1, nonspeculative_index: 0)
INFO:root:Timestamp(1590329519.999000)
INFO:root:('USA', [{'score': 1, 'ts': 5}, {'score': 2, 'ts': 5}])
INFO:root:-----------------
INFO:root:2020-05-24 14:12:00
INFO:root:2020-05-24 14:14:00
INFO:root:PaneInfo(first: False, last: True, timing: ON_TIME, index: 1, nonspeculative_index: 0)
INFO:root:Timestamp(1590329639.999000)
INFO:root:('USA', [{'score': 5, 'ts': 105}, {'score': 4, 'ts': 60}, {'score': 6, 'ts': 105}, {'score': 3, 'ts': 60}])
INFO:root:-----------------
使用丢弃
窗格下面是行为
INFO:root:2020-05-24 14:12:00
INFO:root:2020-05-24 14:14:00
INFO:root:PaneInfo(first: True, last: False, timing: EARLY, index: 0, nonspeculative_index: -1)
INFO:root:Timestamp(1590329639.999000)
INFO:root:('USA', [{'score': 2, 'ts': 5}, {'score': 4, 'ts': 60}, {'score': 1, 'ts': 5}, {'score': 3, 'ts': 60}])
INFO:root:-----------------
INFO:root:2020-05-24 14:14:00
INFO:root:2020-05-24 14:16:00
INFO:root:PaneInfo(first: True, last: False, timing: EARLY, index: 0, nonspeculative_index: -1)
INFO:root:Timestamp(1590329759.999000)
INFO:root:('USA', [{'score': 5, 'ts': 105}, {'score': 6, 'ts': 105}])
INFO:root:-----------------
INFO:root:2020-05-24 14:12:00
INFO:root:2020-05-24 14:14:00
INFO:root:PaneInfo(first: False, last: True, timing: ON_TIME, index: 1, nonspeculative_index: 0)
INFO:root:Timestamp(1590329639.999000)
INFO:root:('USA', [])
INFO:root:-----------------
INFO:root:2020-05-24 14:14:00
INFO:root:2020-05-24 14:16:00
INFO:root:PaneInfo(first: False, last: True, timing: ON_TIME, index: 1, nonspeculative_index: 0)
INFO:root:Timestamp(1590329759.999000)
INFO:root:('USA', [])
INFO:root:-----------------
在累积
的情况下,当到达水印且窗口关闭时,将保留早期
窗格中的值,该窗口由ON_TIME
窗格表示
然而,在丢弃
窗格的情况下,早期
窗格中的值将被丢弃,开启时间
窗格为空
在真实场景中,元素通过Pub/Sub流超过1个早期窗格可能会被触发。在模拟场景中,由于所有值都已存在,因此无法触发超过1个早期窗格。非常感谢您的帮助!是的,当使用流而不是批处理时,行为上的差异变得更加明显。非常感谢您的帮助!是的,当使用流而不是批处理时,行为上的差异变得更加明显