Google cloud dataflow Beam-模式不可知的摄取/聚合是否可能？_Google Cloud Dataflow_Apache Beam

Google cloud dataflow Beam-模式不可知的摄取/聚合是否可能？

google-cloud-dataflow

Google cloud dataflow Beam-模式不可知的摄取/聚合是否可能？,google-cloud-dataflow,apache-beam,Google Cloud Dataflow,Apache Beam,我希望接收具有变化模式的对象流（例如JSON），这些模式是未知的，并应用一个自定义聚合，即已知的在光束中可能吗具体而言，政府可否：摄取具有更改模式的（嵌套）JSON对象列表（在a中）： msg1={“产品”：“苹果”，“价格”：{“货币”：“日元”，“金额”：50} msg2={“产品”：“苹果”，“价格”：{“金额”：70}，“不可用的字段”：“foo”} 在（全局和更新）时间窗口上应用自定义聚合（a）： res={“产品”：“苹果”，“销售”：120，“货币”：“日元”}我在这方面已

我希望接收具有变化模式的对象流（例如JSON），这些模式是未知的，并应用一个自定义聚合，即已知的

在光束中可能吗

具体而言，政府可否：

摄取具有更改模式的（嵌套）JSON对象列表（在a中）：

msg1={“产品”：“苹果”，“价格”：{“货币”：“日元”，“金额”：50}

msg2={“产品”：“苹果”，“价格”：{“金额”：70}，“不可用的字段”：“foo”}

在（全局和更新）时间窗口上应用自定义聚合（a）：

res={“产品”：“苹果”，“销售”：120，“货币”：“日元”}我在这方面已经做了一段时间了，我清楚地记得你需要添加一些调整，并在你这方面进行测试。您需要牢记的事项：

为简单起见，查找键price
的逻辑不在我的示例中，但我在设计时考虑到了这一点。您需要添加它

由于我们将成为一个全局窗口
，我有点担心您可能会在一段时间后遇到内存问题。理论上，合路器提升应使DF仅存储蓄能器，而不是存储所有元件。我有一个类似方法的管道2天没有问题，但在Java中（应该是相同的）

需要一些额外的逻辑来匹配您的确切用例，这将是一个好主意


我正在使用一个高级组合器（1）强制组合器提升（2）以便添加密钥解析逻辑。我使用了一个5分钟的触发器，所以每个键的总和将每5分钟更新一次（如果需要，您可以更改触发器）
我将开始我的代码，假设您已经解析了我们在注释中讨论过的流中的元素。输入我的代码的元素采用以下格式：
        {"product": "apple", "price": {"currency": "JPY", "amount": 50}},
        {"product": "orange", "price": {"amount": 50}},
        {"product": "apple", "price": {"amount": 10}},
        {"product": "orange", "price": {"currency": "EUR", "amount": 50}},
        {"product": "apple", "price": {"currency": "JPY", "amount": 30}}

这些将被传递到管道：
    class NestedDictSum(beam.CombineFn):
        def create_accumulator(self):
            # accumulator instance starts at 0
            return 0

        def add_input(self, sum_value, input):
            # Called for every new element, add it to the accumulator
            return sum_value + self._get_price(input)

        def merge_accumulators(self, accumulators):
            # Called for every accumulator across workers / bundles
            return sum(accumulators)

        def extract_output(self, total):
            # output value from merged accumulators
            return {"sales": total, "currency": "JPY"}

        def _get_price(self, dictionary):
            # Add your logic to find the right key in it
            # Needs to return the parsed price (what you want to sum)
            return dictionary["price"]["amount"]

    def add_product(element):
        dictionary = element[1]
        dictionary["product"] = element[0]
        return dictionary

# Pipeline read stream and so on

     | Map(lambda x: (x["product"], x))  # To KV
     | WindowInto(GlobalWindows(),
                  trigger=trigger.Repeatedly(trigger.AfterProcessingTime(5 * 60)),
                  # This makes the elements not be discarded, so the value would
                  # be updated as new elements are triggered
                  accumulation_mode=trigger.AccumulationMode.ACCUMULATING)
     | CombinePerKey(NestedDictSum())
     | Map(add_product)  # From KV to Dictionary adding Key

其输出为：
{'sales': 90, 'currency': 'JPY', 'product': 'apple'}
{'sales': 100, 'currency': 'JPY', 'product': 'orange'}

请注意，在实际的流式传输情况下，此值将每5分钟更新一次
此外，我认为这个特殊的用例可能会从使用中受益。这将允许您使用更精细的颗粒控制元素，可能比我发布的内容更适合。
我认为问题在于映射（打印）
，请注意打印功能不会返回任何内容。请尝试取出打印内容，或添加您自己的打印功能，但返回元素。@Iñigo:谢谢您的提示。问题仍然存在，但我将在问题中保留这些打印内容。哦，我刚刚发现了问题，您正在beam.Map（create\u random\u record）
中返回字典，而不是键值。CombinePerKey
需要一个元组，其中第一个索引是键，第二个索引是值。我不知道你的意思是什么，你可能需要添加一些额外的逻辑。谢谢。我修改了代码以返回元组列表，并添加了全局时间窗口。相同的错误。好的，最后一次修复（我在手机中，看不到完整的代码）。我看到有几个错误。（1） 您使用的是一个无界源（PubSub），因此为了聚合，您需要将其拆分为有界数据，因此需要一个窗口GlobalWindow
是常规窗口，因此您实际上没有拆分数据，请尝试添加FixedWindow
或SessionWindow
（取决于您的用例）。（2） 在第一个Map
中，您将返回一个列表，注意Map
是一对一操作，因为您要返回多个元素，请使用FlatMap
或ParDo。一旦你尝试这个，我会添加一个实际的答案。非常感谢你的努力和结果！我可以在我这边重新运行你的代码。注意：每种货币应单独汇总。只有不带货币的值应默认为“日元”。有什么方法可以改进您的代码并将“EUR”作为一个整体来对待？当然，这就像在聚合之前创建另一个键一样简单，您可以执行“产品+货币”或只聚合两次。另一种方法是在“\u get\u price”中转换为默认货币，谢谢！两个额外的问题，您对Beam的指导将受到赞赏（设计模式、库）：1）您将如何纳入需要访问一些以前原始消息的聚合规则（例如，选择默认货币作为模式，即原始传入消息中最常见的观察货币）2）持续（例如每10次触发一次）磁盘上的原始/累加器数据，从中读取，并避免内存问题？让我知道，如果这些将值得新的线程或琐碎足够。谢谢（1） 我不确定我是否理解，但“先前”的概念在Beam中是模糊的。我想有状态的dofn是一种很好的方法（2）数据应该在合并器（合并）期间自动持久化，如果我没有弄错合并器提升，那么只有累加器会持久化。无论如何，这里有状态的dofn可以帮助您更灵活地处理数据。非常感谢@Iñigo在这方面的贡献。
    class NestedDictSum(beam.CombineFn):
        def create_accumulator(self):
            # accumulator instance starts at 0
            return 0

        def add_input(self, sum_value, input):
            # Called for every new element, add it to the accumulator
            return sum_value + self._get_price(input)

        def merge_accumulators(self, accumulators):
            # Called for every accumulator across workers / bundles
            return sum(accumulators)

        def extract_output(self, total):
            # output value from merged accumulators
            return {"sales": total, "currency": "JPY"}

        def _get_price(self, dictionary):
            # Add your logic to find the right key in it
            # Needs to return the parsed price (what you want to sum)
            return dictionary["price"]["amount"]

    def add_product(element):
        dictionary = element[1]
        dictionary["product"] = element[0]
        return dictionary

# Pipeline read stream and so on

     | Map(lambda x: (x["product"], x))  # To KV
     | WindowInto(GlobalWindows(),
                  trigger=trigger.Repeatedly(trigger.AfterProcessingTime(5 * 60)),
                  # This makes the elements not be discarded, so the value would
                  # be updated as new elements are triggered
                  accumulation_mode=trigger.AccumulationMode.ACCUMULATING)
     | CombinePerKey(NestedDictSum())
     | Map(add_product)  # From KV to Dictionary adding Key

{'sales': 90, 'currency': 'JPY', 'product': 'apple'}
{'sales': 100, 'currency': 'JPY', 'product': 'orange'}