Python 在apache beam中聚合窗口中的数据_Python_Google Cloud Dataflow_Apache Beam

Python 在apache beam中聚合窗口中的数据

python google-cloud-dataflow

Python 在apache beam中聚合窗口中的数据,python,google-cloud-dataflow,apache-beam,Python,Google Cloud Dataflow,Apache Beam,我正在接收一个复杂的嵌套JSON对象流，作为我对管道的输入我的目标是创建小批量，以反馈给另一个pubsub主题进行下游处理。我正在努力使用beam.beam.GroupByKey（）函数-从我所读到的内容来看，这是尝试和聚合的正确方法一个简化的示例，输入事件： { data:['a', 'b', 'c'], url: 'websiteA.com' } { data:['a', 'b', 'c'], url: 'websiteA.com' } { data:['a'], url: 'websi

我正在接收一个复杂的嵌套JSON对象流，作为我对管道的输入

我的目标是创建小批量，以反馈给另一个

pubsub

主题进行下游处理。我正在努力使用

beam.beam.GroupByKey（）

函数-从我所读到的内容来看，这是尝试和聚合的正确方法

一个简化的示例，输入事件：

{ data:['a', 'b', 'c'], url: 'websiteA.com' }
{ data:['a', 'b', 'c'], url: 'websiteA.com' }
{ data:['a'], url: 'websiteB.com' }

我正在尝试创建以下内容：

{
'websiteA.com': {a:2, b:2, c:2},
'websiteB.com': {a:1},
}

我的问题在于，如果最简单的元组抛出了一个

ValueError：太多的值无法解包，那么就尝试对其进行分组
我可以分两步运行，但是从我的阅读来看，使用beam.GroupByKey（）
非常昂贵，因此应该最小化
根据@Cubez的答案进行编辑
这是我的联合收割机功能，它似乎工作了一半：(
类MyCustomCombiner（beam.CombineFn）：
def创建_收集器（自身）：
logging.info（'accum_created'）#日志正常！
返回{}
def添加_输入（自身、计数、输入）：
计数={}
对于输入中的i：
计数[i]=1
logging.info（计数）#日志正常！
返回计数
def合并_蓄能器（自身、蓄能器）：
logging.info（'accumcalled'）#从不记录任何内容
c=集合。计数器（）
对于d in蓄能器：
c、 更新（d）
logging.info（'acum:%s'，累加器）#从不记录任何内容
返回指令（c）
def提取_输出（自身、计数）：
logging.info（'Counts2:%s'，counts）#从不记录任何内容
返回计数

似乎过去了add\u input
没有调用任何内容
添加管道代码：
将beam.Pipeline（argv=Pipeline_args）作为p：
原始装载量dict=（p
|'ReadPubsubLoads'>>ReadFromPubSub（主题=PUBSUB\u主题\u名称）。具有\u输出\u类型（字节）
|'JSONParse'>>beam.Map（lambda x:json.loads（x））
)
固定窗口事件=（原始加载）
|'KeyOnUrl'>>beam.Map（lambda x:（x['client_id']，x['events']））
|'1MinWindow'>>beam.WindowInto（window.FixedWindows（60））
|'CustomCombine'>>beam.CombinePerKey（MyCustomCombiner（））
)
修复了|窗口|事件|'LogResults2'>>beam.ParDo（LogResults（））
这是一个完美的需要使用的示例。这些转换用于聚合或合并多个工作者的集合。正如文档所述，CombineFns通过读取元素（beam.CombineFn.add\u输入）、合并多个元素（beam.CombineFn.merge\u累加器）来工作，然后最后输出最终的组合值（beam.CombineFn.extract_output）
例如，要创建输出数字集合平均值的组合器，如下所示：
class AverageFn(beam.CombineFn):
  def create_accumulator(self):
    return (0.0, 0)

  def add_input(self, sum_count, input):
    (sum, count) = sum_count
    return sum + input, count + 1

  def merge_accumulators(self, accumulators):
    sums, counts = zip(*accumulators)
    return sum(sums), sum(counts)

  def extract_output(self, sum_count):
    (sum, count) = sum_count
    return sum / count if count else float('NaN')

pc = ...
average = pc | beam.CombineGlobally(AverageFn())

values = [
          {'data':['a', 'b', 'c'], 'url': 'websiteA.com'},
          {'data':['a', 'b', 'c'], 'url': 'websiteA.com'},
          {'data':['a'], 'url': 'websiteB.com'}
]

# This counts the number of elements that are the same.
def combine(counts):
  # A counter is a dictionary from keys to the number of times it has
  # seen that particular key.
  c = collections.Counter()
  for d in counts:
    c.update(d)
  return dict(c)

with beam.Pipeline(options=pipeline_options) as p:
  pc = (p
        # You should replace this step with reading data from your
        # source and transforming it to the proper format for below.
        | 'create' >> beam.Create(values)

        # This step transforms the dictionary to a tuple. For this
        # example it returns:
        # [ ('url': 'websiteA.com', 'data':['a', 'b', 'c']),
        #   ('url': 'websiteA.com', 'data':['a', 'b', 'c']),
        #   ('url': 'websiteB.com', 'data':['a'])]
        | 'url as key' >> beam.Map(lambda x: (x['url'], x['data']))

        # This is the magic that combines all elements with the same
        # URL and outputs a count based on the keys in 'data'.
        # This returns the elements:
        # [ ('url': 'websiteA.com', {'a': 2, 'b': 2, 'c': 2}),
        #   ('url': 'websiteB.com', {'a': 1})]
        | 'combine' >> beam.CombinePerKey(combine))

  # Do something with pc
  new_pc = pc | ...

对于您的用例，我建议如下：
class AverageFn(beam.CombineFn):
  def create_accumulator(self):
    return (0.0, 0)

  def add_input(self, sum_count, input):
    (sum, count) = sum_count
    return sum + input, count + 1

  def merge_accumulators(self, accumulators):
    sums, counts = zip(*accumulators)
    return sum(sums), sum(counts)

  def extract_output(self, sum_count):
    (sum, count) = sum_count
    return sum / count if count else float('NaN')

pc = ...
average = pc | beam.CombineGlobally(AverageFn())

values = [
          {'data':['a', 'b', 'c'], 'url': 'websiteA.com'},
          {'data':['a', 'b', 'c'], 'url': 'websiteA.com'},
          {'data':['a'], 'url': 'websiteB.com'}
]

# This counts the number of elements that are the same.
def combine(counts):
  # A counter is a dictionary from keys to the number of times it has
  # seen that particular key.
  c = collections.Counter()
  for d in counts:
    c.update(d)
  return dict(c)

with beam.Pipeline(options=pipeline_options) as p:
  pc = (p
        # You should replace this step with reading data from your
        # source and transforming it to the proper format for below.
        | 'create' >> beam.Create(values)

        # This step transforms the dictionary to a tuple. For this
        # example it returns:
        # [ ('url': 'websiteA.com', 'data':['a', 'b', 'c']),
        #   ('url': 'websiteA.com', 'data':['a', 'b', 'c']),
        #   ('url': 'websiteB.com', 'data':['a'])]
        | 'url as key' >> beam.Map(lambda x: (x['url'], x['data']))

        # This is the magic that combines all elements with the same
        # URL and outputs a count based on the keys in 'data'.
        # This returns the elements:
        # [ ('url': 'websiteA.com', {'a': 2, 'b': 2, 'c': 2}),
        #   ('url': 'websiteB.com', {'a': 1})]
        | 'combine' >> beam.CombinePerKey(combine))

  # Do something with pc
  new_pc = pc | ...

嘿，你能用你的管道代码编辑吗？嘿@Cubez感谢你的帮助，但我仍然收到一个keyrerror，它很难调试。我正在记录到达add_input
的内容，但它缺少键。我以为你输入了一个错别字，但你实际上想把CustomCombine放在Map（）中？我现在已经试过了，但不明白为什么要这样做？新的错误拒绝将字符串视为可编辑的…
但关键是字符串..后期编辑：在添加输入中，您不需要初始化计数？嘿@dendog，我现在正在测试我的实现，完成后我会进行编辑。仅供参考，Apache Beam有一个Jupyter内核，你可以用它来开发笔记本电脑，它很棒，我正在用它为你写一个更好的答案。哦，很好，这很酷！所以我的问题现在已经解决了-我仍然不明白的是，CombinePerKey（）什么时候
callmerge\u累加器
？它何时知道无界pc已准备就绪？每当窗口关闭时都会调用此命令。例如，如果您有固定的5秒窗口，它将每5秒进行一次合并。对于无界pc集合，您必须手动设置这些窗口，以便在它关闭时通知管道已准备好合并。有关详细信息，请参阅。