Parallel processing Beam/Dataflow-大CoGroupByKey结果导致管道速度缓慢

Parallel processing Beam/Dataflow-大CoGroupByKey结果导致管道速度缓慢,parallel-processing,google-cloud-dataflow,apache-beam,Parallel Processing,Google Cloud Dataflow,Apache Beam,我有两个p集合,一个大小约为150米,第二个大小约为2B 我要做的是计算两个PCollection中每个唯一值对的外观数 所以我在这两个PCollection上做了一个CoGroupByKey,问题是CoGbkResult中的一些(~5M)非常大(我在数据流中得到日志消息说CoGbkResult有超过10K个结果)因为在这两个集合中,每个键都可能出现很多次,这会导致获取这些键的worker运行时间非常长 理想情况下,我希望CoGroupByKey返回一个PCollection,其中包含两个PCo

我有两个
p集合
,一个大小约为150米,第二个大小约为2B

我要做的是计算两个
PCollection
中每个唯一值对的外观数

所以我在这两个PCollection上做了一个
CoGroupByKey
,问题是
CoGbkResult
中的一些(~5M)非常大(我在数据流中得到日志消息说
CoGbkResult
有超过10K个结果)因为在这两个集合中,每个键都可能出现很多次,这会导致获取这些键的worker运行时间非常长

理想情况下,我希望
CoGroupByKey
返回一个
PCollection
,其中包含两个
PCollection
中按键共同分组的所有值对,因此我无法以并行化更好的方式对它们进行计数

我一直在阅读有关此问题的文章,但似乎没有适合我的解决方案(其中大多数包括使用
组合。WithHotKeyFanout
),因为在组合之前,我需要额外的映射步骤,因为
CoGbkResult
的大小,这需要花费很长时间。
有什么建议可以解决这个问题吗?

您是否能够重新格式化数据,以便用
cobineperkey
替换
CoGroupByKey

CoGroupByKey
&
GroupByKey
正在构建所有匹配项的列表,这些匹配项可能会变得非常大,但您只关心计数,对吗?因此,您可以将
CombinePerKey
CombineFn
一起使用,在它们进入时对它们进行计数

从以下内容重新格式化您的PCollections:

pcoll_a = [('abc','123'), ('abc', '456'), ...]
pcoll_b = [('abc','123'), ('xyz', '456'), ...]
pcoll_a = [('abc,123', 'A'), ('abc,456', 'A'), ...]
pcoll_b = [('abc,123', 'B'), ('xyz,456', 'B'), ...]
class CountFn(apache_beam.core.CombineFn):
    def _add_inputs(self, elements, accumulator=None):
        accumulator = accumulator or self.create_accumulator()
        for obj in elements:
            if obj == 'A':
                accumulator['sum_A'] += 1
            if obj == 'B':
                accumulator['sum_B'] += 1
        return accumulator

    def create_accumulator(self):
        return {'sum_A': 0, 'sum_B': 0}

    def add_input(self, accumulator, element, *args, **kwargs):
        return self._add_inputs(elements=[element], accumulator=accumulator)

    def add_inputs(self, accumulator, elements, *args, **kwargs):
        return self._add_inputs(elements=elements, accumulator=accumulator)

    def merge_accumulators(self, accumulators, *args, **kwargs):
        return {
            'sum_A': sum([i['sum_A'] for i in accumulators]),
            'sum_B': sum([i['sum_B'] for i in accumulators])}

    def extract_output(self, accumulator, *args, **kwargs):
        return accumulator
变成这样:

pcoll_a = [('abc','123'), ('abc', '456'), ...]
pcoll_b = [('abc','123'), ('xyz', '456'), ...]
pcoll_a = [('abc,123', 'A'), ('abc,456', 'A'), ...]
pcoll_b = [('abc,123', 'B'), ('xyz,456', 'B'), ...]
class CountFn(apache_beam.core.CombineFn):
    def _add_inputs(self, elements, accumulator=None):
        accumulator = accumulator or self.create_accumulator()
        for obj in elements:
            if obj == 'A':
                accumulator['sum_A'] += 1
            if obj == 'B':
                accumulator['sum_B'] += 1
        return accumulator

    def create_accumulator(self):
        return {'sum_A': 0, 'sum_B': 0}

    def add_input(self, accumulator, element, *args, **kwargs):
        return self._add_inputs(elements=[element], accumulator=accumulator)

    def add_inputs(self, accumulator, elements, *args, **kwargs):
        return self._add_inputs(elements=elements, accumulator=accumulator)

    def merge_accumulators(self, accumulators, *args, **kwargs):
        return {
            'sum_A': sum([i['sum_A'] for i in accumulators]),
            'sum_B': sum([i['sum_B'] for i in accumulators])}

    def extract_output(self, accumulator, *args, **kwargs):
        return accumulator
将这两个p集合展平在一起:

pcoll_combined = [('abc,123', 'A'), ('abc,456', 'A'), ('abc,123', 'B'), ('xyz,456', 'B'), ...]
将此信息传递到
CombinePerKey
中,同时使用一个
CombineFn
对计数进行汇总。大概是这样的:

pcoll_a = [('abc','123'), ('abc', '456'), ...]
pcoll_b = [('abc','123'), ('xyz', '456'), ...]
pcoll_a = [('abc,123', 'A'), ('abc,456', 'A'), ...]
pcoll_b = [('abc,123', 'B'), ('xyz,456', 'B'), ...]
class CountFn(apache_beam.core.CombineFn):
    def _add_inputs(self, elements, accumulator=None):
        accumulator = accumulator or self.create_accumulator()
        for obj in elements:
            if obj == 'A':
                accumulator['sum_A'] += 1
            if obj == 'B':
                accumulator['sum_B'] += 1
        return accumulator

    def create_accumulator(self):
        return {'sum_A': 0, 'sum_B': 0}

    def add_input(self, accumulator, element, *args, **kwargs):
        return self._add_inputs(elements=[element], accumulator=accumulator)

    def add_inputs(self, accumulator, elements, *args, **kwargs):
        return self._add_inputs(elements=elements, accumulator=accumulator)

    def merge_accumulators(self, accumulators, *args, **kwargs):
        return {
            'sum_A': sum([i['sum_A'] for i in accumulators]),
            'sum_B': sum([i['sum_B'] for i in accumulators])}

    def extract_output(self, accumulator, *args, **kwargs):
        return accumulator

你能再解释一下你在做什么手术吗?