Google cloud dataflow 工作线程的增加导致数据流作业挂起在TextIO上。Write-使用DirectRunner-Apache Beam快速执行_Google Cloud Dataflow_Apache Beam_Apache Beam Io

Google cloud dataflow 工作线程的增加导致数据流作业挂起在TextIO上。Write-使用DirectRunner-Apache Beam快速执行

google-cloud-dataflow

Google cloud dataflow 工作线程的增加导致数据流作业挂起在TextIO上。Write-使用DirectRunner-Apache Beam快速执行,google-cloud-dataflow,apache-beam,apache-beam-io,Google Cloud Dataflow,Apache Beam,Apache Beam Io,该程序从文件中摄取记录，解析并将记录保存到数据库，并将故障记录写入云存储桶。我使用的测试文件只创建了3条故障记录-当在本地运行最后一步时，parseResults.get（failedRecords.apply（“WriteFailedRecordsToGCS”，TextIO.write（）.to（failureRecordsPath））以毫秒为单位执行在Dataflow中，我用5个工人运行这个流程。即使在成功写入3条失败记录后，进程仍无限期挂起在写入步骤上。我可以看到它挂在步骤WriteFa

该程序从文件中摄取记录，解析并将记录保存到数据库，并将故障记录写入云存储桶。我使用的测试文件只创建了3条故障记录-当在本地运行最后一步时，

parseResults.get（failedRecords.apply（“WriteFailedRecordsToGCS”，TextIO.write（）.to（failureRecordsPath））以毫秒为单位执行
在Dataflow中，我用5个工人运行这个流程。即使在成功写入3条失败记录后，进程仍无限期挂起在写入步骤上。我可以看到它挂在步骤WriteFailedRecordsToGCS/WriteFiles/FinalizeTempFileBundles/Reshuffle.viaandomkey/Pair with random key.out0

有人能告诉我为什么DirectRunner和Dataflow的行为如此不同吗？整个管道在下面
        StageUtilizationDataSourceOptions options = PipelineOptionsFactory.fromArgs(args).as(StageUtilizationDataSourceOptions.class);
        final TupleTag<Utilization> parsedRecords = new TupleTag<Utilization>("parsedRecords") {};
        final TupleTag<String> failedRecords = new TupleTag<String>("failedRecords") {};
        DrgAnalysisDbStage drgAnalysisDbStage = new DrgAnalysisDbStage(options);
        HashMap<String, Client> clientKeyMap = drgAnalysisDbStage.getClientKeys();

        Pipeline pipeline = Pipeline.create(options);
        PCollectionTuple parseResults = PCollectionTuple.empty(pipeline);

        PCollection<String> records = pipeline.apply("ReadFromGCS", TextIO.read().from(options.getGcsFilePath()));

        if (FileTypes.utilization.equalsIgnoreCase(options.getFileType())) {
             parseResults = records
                    .apply("ConvertToUtilizationRecord", ParDo.of(new ParseUtilizationFile(parsedRecords, failedRecords, clientKeyMap, options.getGcsFilePath()))
                    .withOutputTags(parsedRecords, TupleTagList.of(failedRecords)));
             parseResults.get(parsedRecords).apply("WriteToUtilizationStagingTable", drgAnalysisDbStage.writeUtilizationRecordsToStagingTable());
        } else {
            logger.error("Unrecognized file type provided: " + options.getFileType());
        }

        String failureRecordsPath = Utilities.getFailureRecordsPath(options.getGcsFilePath(), options.getFileType());
        parseResults.get(failedRecords).apply("WriteFailedRecordsToGCS", TextIO.write().to(failureRecordsPath));

        pipeline.run().waitUntilFinish();

StageUtilizationDataSourceOptions选项=PipelineOptionsFactory.fromArgs（args）.as（StageUtilizationDataSourceOptions.class）；
final TupleTag parsedRecords=新TupleTag（“parsedRecords”）{}；
final TupleTag failedRecords=新TupleTag（“failedRecords”）{}；
DrgAnalysisDbStage DrgAnalysisDbStage=新DrgAnalysisDbStage（选项）；
HashMap clientKeyMap=drgAnalysDBStage.getClientKeys（）；
Pipeline=Pipeline.create（选项）；
PCollectionTuple parseResults=PCollectionTuple.empty（管道）；
PCollection records=pipeline.apply（“ReadFromGCS”，TextIO.read（）.from（options.getGcsFilePath（））；
if（FileTypes.utilization.equalsIgnoreCase（options.getFileType（）））{
解析结果=记录
.apply（“ConvertToUtilizationRecord”，ParDo.of（新的ParseUtilizationFile（parsedRecords、failedRecords、clientKeyMap、options.getGcsFilePath（）））
.带输出标签（解析记录、元组列表（失败记录））；
parseResults.get（parsedRecords.apply）（“WriteToUtilizationStagingTable”，drgAnalysDBStage.writeUtilizationRecordsToStagingTable（））；
}否则{
logger.error（“提供的文件类型无法识别：”+options.getFileType（））；
}
String failureRecordsPath=Utilities.getFailureRecordsPath（options.getGcsFilePath（），options.getFileType（））；
parseResults.get（failedRecords）.apply（“WriteFailedRecordsToGCS”，TextIO.write（）.to（failureRecordsPath））；
pipeline.run（）.waitUntilFinish（）；
如果仅使用一个辅助进程启动数据流进程，则写入步骤的行为与使用DirectRunner启动时的行为相同-在1秒内成功解析。为什么越来越多的工作进程会严重阻碍写入过程？您的防火墙规则是否允许数据流工作进程之间的通信？对于多个工作进程，所有其他步骤似乎都可以正常执行，但我会检查一下。是的，但步骤可以在同一个工作进程中融合并按顺序执行。然而，在重新洗牌的步骤中，数据需要重新洗牌，因为我的防火墙规则没有正确配置以允许工作人员相互通信。按照下面链接上的说明正确配置防火墙。按照下面的指示，我的问题完全解决了。