Python 在apache beam中聚合窗口中的数据
我正在接收一个复杂的嵌套JSON对象流,作为我对管道的输入 我的目标是创建小批量,以反馈给另一个Python 在apache beam中聚合窗口中的数据,python,google-cloud-dataflow,apache-beam,Python,Google Cloud Dataflow,Apache Beam,我正在接收一个复杂的嵌套JSON对象流,作为我对管道的输入 我的目标是创建小批量,以反馈给另一个pubsub主题进行下游处理。我正在努力使用beam.beam.GroupByKey()函数-从我所读到的内容来看,这是尝试和聚合的正确方法 一个简化的示例,输入事件: { data:['a', 'b', 'c'], url: 'websiteA.com' } { data:['a', 'b', 'c'], url: 'websiteA.com' } { data:['a'], url: 'websi
pubsub
主题进行下游处理。我正在努力使用beam.beam.GroupByKey()
函数-从我所读到的内容来看,这是尝试和聚合的正确方法
一个简化的示例,输入事件:
{ data:['a', 'b', 'c'], url: 'websiteA.com' }
{ data:['a', 'b', 'c'], url: 'websiteA.com' }
{ data:['a'], url: 'websiteB.com' }
我正在尝试创建以下内容:
{
'websiteA.com': {a:2, b:2, c:2},
'websiteB.com': {a:1},
}
我的问题在于,如果最简单的元组抛出了一个ValueError:太多的值无法解包,那么就尝试对其进行分组
我可以分两步运行,但是从我的阅读来看,使用beam.GroupByKey()
非常昂贵,因此应该最小化
根据@Cubez的答案进行编辑
这是我的联合收割机功能,它似乎工作了一半:(
类MyCustomCombiner(beam.CombineFn):
def创建_收集器(自身):
logging.info('accum_created')#日志正常!
返回{}
def添加_输入(自身、计数、输入):
计数={}
对于输入中的i:
计数[i]=1
logging.info(计数)#日志正常!
返回计数
def合并_蓄能器(自身、蓄能器):
logging.info('accumcalled')#从不记录任何内容
c=集合。计数器()
对于d in蓄能器:
c、 更新(d)
logging.info('acum:%s',累加器)#从不记录任何内容
返回指令(c)
def提取_输出(自身、计数):
logging.info('Counts2:%s',counts)#从不记录任何内容
返回计数
似乎过去了add\u input
没有调用任何内容
添加管道代码:
将beam.Pipeline(argv=Pipeline_args)作为p:
原始装载量dict=(p
|'ReadPubsubLoads'>>ReadFromPubSub(主题=PUBSUB\u主题\u名称)。具有\u输出\u类型(字节)
|'JSONParse'>>beam.Map(lambda x:json.loads(x))
)
固定窗口事件=(原始加载)
|'KeyOnUrl'>>beam.Map(lambda x:(x['client_id'],x['events']))
|'1MinWindow'>>beam.WindowInto(window.FixedWindows(60))
|'CustomCombine'>>beam.CombinePerKey(MyCustomCombiner())
)
修复了|窗口|事件|'LogResults2'>>beam.ParDo(LogResults())
这是一个完美的需要使用的示例。这些转换用于聚合或合并多个工作者的集合。正如文档所述,CombineFns通过读取元素(beam.CombineFn.add\u输入)、合并多个元素(beam.CombineFn.merge\u累加器)来工作,然后最后输出最终的组合值(beam.CombineFn.extract_output)
例如,要创建输出数字集合平均值的组合器,如下所示:
class AverageFn(beam.CombineFn):
def create_accumulator(self):
return (0.0, 0)
def add_input(self, sum_count, input):
(sum, count) = sum_count
return sum + input, count + 1
def merge_accumulators(self, accumulators):
sums, counts = zip(*accumulators)
return sum(sums), sum(counts)
def extract_output(self, sum_count):
(sum, count) = sum_count
return sum / count if count else float('NaN')
pc = ...
average = pc | beam.CombineGlobally(AverageFn())
values = [
{'data':['a', 'b', 'c'], 'url': 'websiteA.com'},
{'data':['a', 'b', 'c'], 'url': 'websiteA.com'},
{'data':['a'], 'url': 'websiteB.com'}
]
# This counts the number of elements that are the same.
def combine(counts):
# A counter is a dictionary from keys to the number of times it has
# seen that particular key.
c = collections.Counter()
for d in counts:
c.update(d)
return dict(c)
with beam.Pipeline(options=pipeline_options) as p:
pc = (p
# You should replace this step with reading data from your
# source and transforming it to the proper format for below.
| 'create' >> beam.Create(values)
# This step transforms the dictionary to a tuple. For this
# example it returns:
# [ ('url': 'websiteA.com', 'data':['a', 'b', 'c']),
# ('url': 'websiteA.com', 'data':['a', 'b', 'c']),
# ('url': 'websiteB.com', 'data':['a'])]
| 'url as key' >> beam.Map(lambda x: (x['url'], x['data']))
# This is the magic that combines all elements with the same
# URL and outputs a count based on the keys in 'data'.
# This returns the elements:
# [ ('url': 'websiteA.com', {'a': 2, 'b': 2, 'c': 2}),
# ('url': 'websiteB.com', {'a': 1})]
| 'combine' >> beam.CombinePerKey(combine))
# Do something with pc
new_pc = pc | ...
对于您的用例,我建议如下:
class AverageFn(beam.CombineFn):
def create_accumulator(self):
return (0.0, 0)
def add_input(self, sum_count, input):
(sum, count) = sum_count
return sum + input, count + 1
def merge_accumulators(self, accumulators):
sums, counts = zip(*accumulators)
return sum(sums), sum(counts)
def extract_output(self, sum_count):
(sum, count) = sum_count
return sum / count if count else float('NaN')
pc = ...
average = pc | beam.CombineGlobally(AverageFn())
values = [
{'data':['a', 'b', 'c'], 'url': 'websiteA.com'},
{'data':['a', 'b', 'c'], 'url': 'websiteA.com'},
{'data':['a'], 'url': 'websiteB.com'}
]
# This counts the number of elements that are the same.
def combine(counts):
# A counter is a dictionary from keys to the number of times it has
# seen that particular key.
c = collections.Counter()
for d in counts:
c.update(d)
return dict(c)
with beam.Pipeline(options=pipeline_options) as p:
pc = (p
# You should replace this step with reading data from your
# source and transforming it to the proper format for below.
| 'create' >> beam.Create(values)
# This step transforms the dictionary to a tuple. For this
# example it returns:
# [ ('url': 'websiteA.com', 'data':['a', 'b', 'c']),
# ('url': 'websiteA.com', 'data':['a', 'b', 'c']),
# ('url': 'websiteB.com', 'data':['a'])]
| 'url as key' >> beam.Map(lambda x: (x['url'], x['data']))
# This is the magic that combines all elements with the same
# URL and outputs a count based on the keys in 'data'.
# This returns the elements:
# [ ('url': 'websiteA.com', {'a': 2, 'b': 2, 'c': 2}),
# ('url': 'websiteB.com', {'a': 1})]
| 'combine' >> beam.CombinePerKey(combine))
# Do something with pc
new_pc = pc | ...
嘿,你能用你的管道代码编辑吗?嘿@Cubez感谢你的帮助,但我仍然收到一个keyrerror,它很难调试。我正在记录到达add_input
的内容,但它缺少键。我以为你输入了一个错别字,但你实际上想把CustomCombine放在Map()中?我现在已经试过了,但不明白为什么要这样做?新的错误拒绝将字符串视为可编辑的…
但关键是字符串..后期编辑:在添加输入中,您不需要初始化计数?嘿@dendog,我现在正在测试我的实现,完成后我会进行编辑。仅供参考,Apache Beam有一个Jupyter内核,你可以用它来开发笔记本电脑,它很棒,我正在用它为你写一个更好的答案。哦,很好,这很酷!所以我的问题现在已经解决了-我仍然不明白的是,CombinePerKey()什么时候
callmerge\u累加器
?它何时知道无界pc已准备就绪?每当窗口关闭时都会调用此命令。例如,如果您有固定的5秒窗口,它将每5秒进行一次合并。对于无界pc集合,您必须手动设置这些窗口,以便在它关闭时通知管道已准备好合并。有关详细信息,请参阅。