Apache Beam Python SDK-会话窗口间隔不准确
我正在尝试使用ApacheBeamPython SDK以60分钟的会话间隔处理数据。但实际会话间隔不准确,例如运行应用程序时Apache Beam Python SDK-会话窗口间隔不准确,python,google-cloud-dataflow,apache-beam,Python,Google Cloud Dataflow,Apache Beam,我正在尝试使用ApacheBeamPython SDK以60分钟的会话间隔处理数据。但实际会话间隔不准确,例如运行应用程序时3:00:00或1:01:00或1:50:00 您能帮我找到一个解决方案来解决这个问题并用60分钟的会话处理数据吗 我建立了我的管道作为贝娄 with Pipeline(options=pipeline_options) as pipeline: ( pipeline | "Read" >>
3:00:00
或1:01:00
或1:50:00
您能帮我找到一个解决方案来解决这个问题并用60分钟的会话处理数据吗
我建立了我的管道作为贝娄
with Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| "Read" >> ReadFromText(known_args.input, skip_header_lines=1)
| "Convert" >> ParDo(Convert())
| "Add Timestamp" >> Map(lambda x: window.TimestampedValue(x, get_timestamp_from_element(x).timestamp()))
| "Use User ID As Key" >> Map(lambda x: (x["user_id"], x))
| "Apply Session Window" >> WindowInto(window.Sessions(known_args.session_interval))
| "Group" >> GroupByKey()
| "Write To CSV" >> ParDo(WriteToCSV(known_args.output))
)
result = pipeline.run()
result.wait_until_finish()
会话间隔(60分钟)如下所示
parser.add_argument(
"--session_interval",
help="Interval of each session",
default=60*60) # 60 mins
WriteToCSV
功能处理每个会话的数据。我记录了会话持续时间,但不准确
class WriteToCSV(DoFn):
def __init__(self, output_path):
self.output_path = output_path
def process(self, element, window=DoFn.WindowParam):
window_start = window.start.to_utc_datetime()
window_end = window.end.to_utc_datetime()
duration = window_end - window_start
logging.info(">>> new %s record(s) in %s session (start %s end %s)", len(click_records), duration, window_start, window_end)
....
然后,当我使用DirectRunner在本地运行此应用程序时,我收到了此日志消息
new 5 records in 3:00:00 session (start 2018-10-19 02:00:00 end 2018-10-19 05:00:00)
new 2 records in 1:01:00 session (start 2018-10-19 02:02:00 end 2018-10-19 03:03:00)
new 2 records in 1:50:00 session (start 2018-10-19 03:10:00 end 2018-10-19 05:00:00)
我还将管道部署到数据流,然后得到了相同的结果
new 2 record(s) in 1:50:00 session (start 2018-10-19 11:10:00 end 2018-10-19 13:00:00)
new 2 record(s) in 1:01:00 session (start 2018-10-19 10:02:00 end 2018-10-19 11:03:00)
new 5 record(s) in 3:00:00 session (start 2018-10-19 10:00:00 end 2018-10-19 13:00:00)
在beam管道中,
window.Sessions`中的变量“`known_args.session_interval”定义了间隙持续时间,即如果某个特定键没有进一步的事件发生,则窗口关闭的持续时间。根据管道为给定密钥处理的事件数,每个会话可以具有不同的开始和结束持续时间。这是用图画来解释的
比如说
Key 1 - 10:00 AM ----|
Key 1 - 10:45 AM |
Key 1 - 11:30 AM |====> One Session Window for Key 1 of Duration 4hours 30 minutes
Key 1 - 12:15 PM |
Key 1 - 01:00 PM ----|
Key 1 - 02:30 PM =========> Start of new session window for Key 1
Key 2 - 10:00 AM-----|
Key 2 - 10:30 AM |====> One Session window for key 2 of Duration 1:00 hour
Key 2 - 11:00 PM-----|
Key 2 - 12:30 PM =========> Start of new session window for Key 2
如果您对每60分钟分组和处理一次事件感兴趣,那么您需要使用FixedWindows