Google cloud dataflow 工作线程的增加导致数据流作业挂起在TextIO上。Write-使用DirectRunner-Apache Beam快速执行

Google cloud dataflow 工作线程的增加导致数据流作业挂起在TextIO上。Write-使用DirectRunner-Apache Beam快速执行,google-cloud-dataflow,apache-beam,apache-beam-io,Google Cloud Dataflow,Apache Beam,Apache Beam Io,该程序从文件中摄取记录,解析并将记录保存到数据库,并将故障记录写入云存储桶。我使用的测试文件只创建了3条故障记录-当在本地运行最后一步时,parseResults.get(failedRecords.apply(“WriteFailedRecordsToGCS”,TextIO.write().to(failureRecordsPath))以毫秒为单位执行 在Dataflow中,我用5个工人运行这个流程。即使在成功写入3条失败记录后,进程仍无限期挂起在写入步骤上。我可以看到它挂在步骤WriteFa

该程序从文件中摄取记录,解析并将记录保存到数据库,并将故障记录写入云存储桶。我使用的测试文件只创建了3条故障记录-当在本地运行最后一步时,
parseResults.get(failedRecords.apply(“WriteFailedRecordsToGCS”,TextIO.write().to(failureRecordsPath))以毫秒为单位执行

在Dataflow中,我用5个工人运行这个流程。即使在成功写入3条失败记录后,进程仍无限期挂起在写入步骤上。我可以看到它挂在步骤
WriteFailedRecordsToGCS/WriteFiles/FinalizeTempFileBundles/Reshuffle.viaandomkey/Pair with random key.out0

有人能告诉我为什么DirectRunner和Dataflow的行为如此不同吗?整个管道在下面

        StageUtilizationDataSourceOptions options = PipelineOptionsFactory.fromArgs(args).as(StageUtilizationDataSourceOptions.class);
        final TupleTag<Utilization> parsedRecords = new TupleTag<Utilization>("parsedRecords") {};
        final TupleTag<String> failedRecords = new TupleTag<String>("failedRecords") {};
        DrgAnalysisDbStage drgAnalysisDbStage = new DrgAnalysisDbStage(options);
        HashMap<String, Client> clientKeyMap = drgAnalysisDbStage.getClientKeys();

        Pipeline pipeline = Pipeline.create(options);
        PCollectionTuple parseResults = PCollectionTuple.empty(pipeline);

        PCollection<String> records = pipeline.apply("ReadFromGCS", TextIO.read().from(options.getGcsFilePath()));

        if (FileTypes.utilization.equalsIgnoreCase(options.getFileType())) {
             parseResults = records
                    .apply("ConvertToUtilizationRecord", ParDo.of(new ParseUtilizationFile(parsedRecords, failedRecords, clientKeyMap, options.getGcsFilePath()))
                    .withOutputTags(parsedRecords, TupleTagList.of(failedRecords)));
             parseResults.get(parsedRecords).apply("WriteToUtilizationStagingTable", drgAnalysisDbStage.writeUtilizationRecordsToStagingTable());
        } else {
            logger.error("Unrecognized file type provided: " + options.getFileType());
        }

        String failureRecordsPath = Utilities.getFailureRecordsPath(options.getGcsFilePath(), options.getFileType());
        parseResults.get(failedRecords).apply("WriteFailedRecordsToGCS", TextIO.write().to(failureRecordsPath));

        pipeline.run().waitUntilFinish();
StageUtilizationDataSourceOptions选项=PipelineOptionsFactory.fromArgs(args).as(StageUtilizationDataSourceOptions.class);
final TupleTag parsedRecords=新TupleTag(“parsedRecords”){};
final TupleTag failedRecords=新TupleTag(“failedRecords”){};
DrgAnalysisDbStage DrgAnalysisDbStage=新DrgAnalysisDbStage(选项);
HashMap clientKeyMap=drgAnalysDBStage.getClientKeys();
Pipeline=Pipeline.create(选项);
PCollectionTuple parseResults=PCollectionTuple.empty(管道);
PCollection records=pipeline.apply(“ReadFromGCS”,TextIO.read().from(options.getGcsFilePath());
if(FileTypes.utilization.equalsIgnoreCase(options.getFileType())){
解析结果=记录
.apply(“ConvertToUtilizationRecord”,ParDo.of(新的ParseUtilizationFile(parsedRecords、failedRecords、clientKeyMap、options.getGcsFilePath()))
.带输出标签(解析记录、元组列表(失败记录));
parseResults.get(parsedRecords.apply)(“WriteToUtilizationStagingTable”,drgAnalysDBStage.writeUtilizationRecordsToStagingTable());
}否则{
logger.error(“提供的文件类型无法识别:”+options.getFileType());
}
String failureRecordsPath=Utilities.getFailureRecordsPath(options.getGcsFilePath(),options.getFileType());
parseResults.get(failedRecords).apply(“WriteFailedRecordsToGCS”,TextIO.write().to(failureRecordsPath));
pipeline.run().waitUntilFinish();

如果仅使用一个辅助进程启动数据流进程,则写入步骤的行为与使用DirectRunner启动时的行为相同-在1秒内成功解析。为什么越来越多的工作进程会严重阻碍写入过程?您的防火墙规则是否允许数据流工作进程之间的通信?对于多个工作进程,所有其他步骤似乎都可以正常执行,但我会检查一下。是的,但步骤可以在同一个工作进程中融合并按顺序执行。然而,在重新洗牌的步骤中,数据需要重新洗牌,因为我的防火墙规则没有正确配置以允许工作人员相互通信。按照下面链接上的说明正确配置防火墙。按照下面的指示,我的问题完全解决了。