Google cloud dataflow 可拆分DoFn导致洗牌密钥过大问题

Google cloud dataflow 可拆分DoFn导致洗牌密钥过大问题,google-cloud-dataflow,apache-beam,apache-beam-io,Google Cloud Dataflow,Apache Beam,Apache Beam Io,我正在尝试实现一个listflant函数,我使用SimpleDoFn实现了它,该函数工作正常,但用于并行化。我正在将函数转换为可拆分Do函数。我使用DirectRunner在本地运行了一个包含5000个元素的单元测试,而在数据流中运行相同的单元测试时,它失败了,错误如下 Error Details: java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException

我正在尝试实现一个
listflant
函数,我使用
SimpleDoFn
实现了它,该函数工作正常,但用于并行化。我正在将函数转换为可拆分Do函数。我使用
DirectRunner
在本地运行了一个包含5000个元素的单元测试,而在数据流中运行相同的单元测试时,它失败了,错误如下

Error Details: 
java.lang.RuntimeException: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: java.io.IOException: INVALID_ARGUMENT: Shuffle key too large:3749653 > 1572864
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output (GroupAlsoByWindowsParDoFn.java:184)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue (GroupAlsoByWindowFnRunner.java:102)
at org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowViaIteratorsFn.processElement (BatchGroupAlsoByWindowViaIteratorsFn.java:126)
at org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowViaIteratorsFn.processElement (BatchGroupAlsoByWindowViaIteratorsFn.java:54)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement (GroupAlsoByWindowFnRunner.java:115)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement (GroupAlsoByWindowFnRunner.java:73)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement (GroupAlsoByWindowsParDoFn.java:114)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process (ParDoOperation.java:44)
at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process (OutputReceiver.java:49)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop (ReadOperation.java:201)
Caused by: org.apache.beam.sdk.util.UserCodeException: java.lang.RuntimeException: java.io.IOException: INVALID_ARGUMENT: Shuffle key too large:3749653 > 1572864
at com.abc.common.batch.functions.AbcListFlattenFn.splitRestriction (AbcListFlattenFn.java:68)

本地DirectRunner和云数据流runner之间的数据差异如下所示

本地的DirectRunner:

  • 样本输入PCollection元素中有5000个ABC
  • 云中的DataflowRunner:

  • 600个输入PCollection元素中有不同大小的ABC
  • 很少有输入元素有50000个ABC要展平
  • public类AbcList实现可序列化{
    私人名单ABC;
    私有列表XYZ;
    }
    公共类abclistflattfn扩展了DoFn{
    输出(KV.of(abc,input.getXyzs());
    }); */
    for(long index=tracker.currentRestriction().getFrom();tracker.tryClaim(index);
    ++索引){
    输出(KV.of(input.getAbcs().get(Math.toIntExact(index),input.getXyzs()));
    }
    }捕获(例外e){
    日志错误(“展平AbcList失败”,e);
    }
    }
    @GetInitialRestriction
    public OffsetRange getInitialRestriction(AbcList输入){
    返回新的偏移范围(0,input.getAbcs().size());
    }
    @拆分限制
    公共无效拆分限制(最终AbcList输入,
    最终偏移范围,最终输出接收器(接收机){
    列表范围=
    range.split(input.getAbcs().size()>5000?5000
    :input.getAbcs().size(),2000);
    对于(最终偏移范围p:范围){
    接收机输出(p);
    }
    }
    @纽特拉克
    公共偏移范围跟踪程序(偏移范围){
    返回新的OffsetRangeTracker(范围);
    }
    }
    

    有人能告诉我ListFlant函数有什么问题吗?拆分限制是否导致以下问题?如何解决此洗牌密钥大小问题?

    洗牌密钥大小限制是由proto大小决定的。为了解决这个问题,您可能希望在SDF之前添加一个改组。改组将帮助您完成第一轮分发。

    您是否能够解决此问题?
       public class AbcList implements Serializable {
            private List<Abc> abcs;
            private List<Xyz> xyzs;
       }
    
            public class AbcListFlattenFn extends DoFn<AbcList, KV<Abc, List<Xyz>> {
    
                @ProcessElement
                public void process(@Element AbcList input,
                    ProcessContext context, RestrictionTracker<OffsetRange, Long> tracker) {
    
                    try {
                /* Below commented lines are without the Splittable DoFn
                           input.getAbcs().stream().forEach(abc -> {
                                    context.output(KV.of(abc, input.getXyzs()));
                             }); */
    
                        for (long index = tracker.currentRestriction().getFrom(); tracker.tryClaim(index);
                            ++index) {
                            context.output(KV.of(input.getAbcs().get(Math.toIntExact(index),input.getXyzs())));
                        }
                    } catch (Exception e) {
                        log.error("Flattening AbcList has failed ", e);
                    }
    
                }
    
                @GetInitialRestriction
                public OffsetRange getInitialRestriction(AbcList input) {
                    return new OffsetRange(0, input.getAbcs().size());
                }
    
                @SplitRestriction
                public void splitRestriction(final AbcList input,
                    final OffsetRange range, final OutputReceiver<OffsetRange> receiver) {
                  List<OffsetRange> ranges =
                      range.split(input.getAbcs().size() > 5000 ? 5000
                            : input.getAbcs().size(), 2000);
                    for (final OffsetRange p : ranges) {
                        receiver.output(p);
                    }
                }
    
                @NewTracker
                public OffsetRangeTracker newTracker(OffsetRange range) {
                    return new OffsetRangeTracker(range);
                }
            }