Parallel processing Beam/Dataflow-大CoGroupByKey结果导致管道速度缓慢_Parallel Processing_Google Cloud Dataflow_Apache Beam

Parallel processing Beam/Dataflow-大CoGroupByKey结果导致管道速度缓慢

parallel-processing google-cloud-dataflow

Parallel processing Beam/Dataflow-大CoGroupByKey结果导致管道速度缓慢,parallel-processing,google-cloud-dataflow,apache-beam,Parallel Processing,Google Cloud Dataflow,Apache Beam,我有两个p集合，一个大小约为150米，第二个大小约为2B 我要做的是计算两个PCollection中每个唯一值对的外观数所以我在这两个PCollection上做了一个CoGroupByKey，问题是CoGbkResult中的一些（~5M）非常大（我在数据流中得到日志消息说CoGbkResult有超过10K个结果）因为在这两个集合中，每个键都可能出现很多次，这会导致获取这些键的worker运行时间非常长理想情况下，我希望CoGroupByKey返回一个PCollection，其中包含两个PCo

我有两个

p集合

，一个大小约为150米，第二个大小约为2B

我要做的是计算两个

PCollection

中每个唯一值对的外观数

所以我在这两个PCollection上做了一个

CoGroupByKey

，问题是

CoGbkResult

中的一些（~5M）非常大（我在数据流中得到日志消息说

CoGbkResult

有超过10K个结果）因为在这两个集合中，每个键都可能出现很多次，这会导致获取这些键的worker运行时间非常长

理想情况下，我希望

CoGroupByKey

返回一个

PCollection

，其中包含两个

PCollection

中按键共同分组的所有值对，因此我无法以并行化更好的方式对它们进行计数

我一直在阅读有关此问题的文章，但似乎没有适合我的解决方案（其中大多数包括使用

组合。WithHotKeyFanout

），因为在组合之前，我需要额外的映射步骤，因为

CoGbkResult

的大小，这需要花费很长时间。

有什么建议可以解决这个问题吗？

您是否能够重新格式化数据，以便用

cobineperkey

替换

CoGroupByKey

CoGroupByKey

GroupByKey

正在构建所有匹配项的列表，这些匹配项可能会变得非常大，但您只关心计数，对吗？因此，您可以将

CombinePerKey

与

CombineFn

一起使用，在它们进入时对它们进行计数

从以下内容重新格式化您的PCollections：

pcoll_a = [('abc','123'), ('abc', '456'), ...]
pcoll_b = [('abc','123'), ('xyz', '456'), ...]

pcoll_a = [('abc,123', 'A'), ('abc,456', 'A'), ...]
pcoll_b = [('abc,123', 'B'), ('xyz,456', 'B'), ...]

class CountFn(apache_beam.core.CombineFn):
    def _add_inputs(self, elements, accumulator=None):
        accumulator = accumulator or self.create_accumulator()
        for obj in elements:
            if obj == 'A':
                accumulator['sum_A'] += 1
            if obj == 'B':
                accumulator['sum_B'] += 1
        return accumulator

    def create_accumulator(self):
        return {'sum_A': 0, 'sum_B': 0}

    def add_input(self, accumulator, element, *args, **kwargs):
        return self._add_inputs(elements=[element], accumulator=accumulator)

    def add_inputs(self, accumulator, elements, *args, **kwargs):
        return self._add_inputs(elements=elements, accumulator=accumulator)

    def merge_accumulators(self, accumulators, *args, **kwargs):
        return {
            'sum_A': sum([i['sum_A'] for i in accumulators]),
            'sum_B': sum([i['sum_B'] for i in accumulators])}

    def extract_output(self, accumulator, *args, **kwargs):
        return accumulator

变成这样：

pcoll_a = [('abc','123'), ('abc', '456'), ...]
pcoll_b = [('abc','123'), ('xyz', '456'), ...]

pcoll_a = [('abc,123', 'A'), ('abc,456', 'A'), ...]
pcoll_b = [('abc,123', 'B'), ('xyz,456', 'B'), ...]

class CountFn(apache_beam.core.CombineFn):
    def _add_inputs(self, elements, accumulator=None):
        accumulator = accumulator or self.create_accumulator()
        for obj in elements:
            if obj == 'A':
                accumulator['sum_A'] += 1
            if obj == 'B':
                accumulator['sum_B'] += 1
        return accumulator

    def create_accumulator(self):
        return {'sum_A': 0, 'sum_B': 0}

    def add_input(self, accumulator, element, *args, **kwargs):
        return self._add_inputs(elements=[element], accumulator=accumulator)

    def add_inputs(self, accumulator, elements, *args, **kwargs):
        return self._add_inputs(elements=elements, accumulator=accumulator)

    def merge_accumulators(self, accumulators, *args, **kwargs):
        return {
            'sum_A': sum([i['sum_A'] for i in accumulators]),
            'sum_B': sum([i['sum_B'] for i in accumulators])}

    def extract_output(self, accumulator, *args, **kwargs):
        return accumulator

将这两个p集合展平在一起：

pcoll_combined = [('abc,123', 'A'), ('abc,456', 'A'), ('abc,123', 'B'), ('xyz,456', 'B'), ...]

将此信息传递到

CombinePerKey

中，同时使用一个

CombineFn

对计数进行汇总。大概是这样的：

pcoll_a = [('abc','123'), ('abc', '456'), ...]
pcoll_b = [('abc','123'), ('xyz', '456'), ...]

pcoll_a = [('abc,123', 'A'), ('abc,456', 'A'), ...]
pcoll_b = [('abc,123', 'B'), ('xyz,456', 'B'), ...]

class CountFn(apache_beam.core.CombineFn):
    def _add_inputs(self, elements, accumulator=None):
        accumulator = accumulator or self.create_accumulator()
        for obj in elements:
            if obj == 'A':
                accumulator['sum_A'] += 1
            if obj == 'B':
                accumulator['sum_B'] += 1
        return accumulator

    def create_accumulator(self):
        return {'sum_A': 0, 'sum_B': 0}

    def add_input(self, accumulator, element, *args, **kwargs):
        return self._add_inputs(elements=[element], accumulator=accumulator)

    def add_inputs(self, accumulator, elements, *args, **kwargs):
        return self._add_inputs(elements=elements, accumulator=accumulator)

    def merge_accumulators(self, accumulators, *args, **kwargs):
        return {
            'sum_A': sum([i['sum_A'] for i in accumulators]),
            'sum_B': sum([i['sum_B'] for i in accumulators])}

    def extract_output(self, accumulator, *args, **kwargs):
        return accumulator

你能再解释一下你在做什么手术吗？