Google cloud platform 如何计算Google数据流文件处理的输入文件中的行数？_Google Cloud Platform_Google Cloud Dataflow_Apache Beam_Google Dataflow

Google cloud platform 如何计算Google数据流文件处理的输入文件中的行数？

google-cloud-platform google-cloud-dataflow

Google cloud platform 如何计算Google数据流文件处理的输入文件中的行数？,google-cloud-platform,google-cloud-dataflow,apache-beam,google-dataflow,Google Cloud Platform,Google Cloud Dataflow,Apache Beam,Google Dataflow,我试图计算输入文件中的行数，并使用CloudDataflowRunner创建模板。在下面的代码中，我从GCS存储桶读取文件，对其进行处理，然后将输出存储在Redis实例中但我无法计算输入文件的行数主类 public static void main(String[] args) { /** * Constructed StorageToRedisOptions object using the method PipelineOptionsFactory.fromArgs

我试图计算输入文件中的行数，并使用CloudDataflowRunner创建模板。在下面的代码中，我从GCS存储桶读取文件，对其进行处理，然后将输出存储在Redis实例中

但我无法计算输入文件的行数

主类

 public static void main(String[] args) {
    /**
     * Constructed StorageToRedisOptions object using the method PipelineOptionsFactory.fromArgs to read options from command-line
     */
    StorageToRedisOptions options = PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(StorageToRedisOptions.class);

    Pipeline p = Pipeline.create(options);
    p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()))
            .apply("Transforming data...",
                    ParDo.of(new DoFn<String, String[]>() {
                        @ProcessElement
                        public void TransformData(@Element String line, OutputReceiver<String[]> out) {
                            String[] fields = line.split("\\|");
                            out.output(fields);
                        }
                    }))
            .apply("Processing data...",
                    ParDo.of(new DoFn<String[], KV<String, String>>() {
                        @ProcessElement
                        public void ProcessData(@Element String[] fields, OutputReceiver<KV<String, String>> out) {
                            if (fields[RedisIndex.GUID.getValue()] != null) {

                                out.output(KV.of("firstname:"
                                        .concat(fields[RedisIndex.FIRSTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));

                                out.output(KV.of("lastname:"
                                        .concat(fields[RedisIndex.LASTNAME.getValue()]), fields[RedisIndex.GUID.getValue()]));

                                out.output(KV.of("dob:"
                                        .concat(fields[RedisIndex.DOB.getValue()]), fields[RedisIndex.GUID.getValue()]));

                                out.output(KV.of("postalcode:"
                                        .concat(fields[RedisIndex.POSTAL_CODE.getValue()]), fields[RedisIndex.GUID.getValue()]));

                            }
                        }
                    }))
            .apply("Writing field indexes into redis",
            RedisIO.write().withMethod(RedisIO.Write.Method.SADD)
                    .withEndpoint(options.getRedisHost(), options.getRedisPort()));
    p.run();

}

执行管道的命令

xxxxxxxxxxxxxxxx|bruce|wayne|31051989|444444444444
yyyyyyyyyyyyyyyy|selina|thomas|01051989|222222222222
aaaaaaaaaaaaaaaa|clark|kent|31051990|666666666666

mvn compile exec:java \
  -Dexec.mainClass=com.viveknaskar.DataFlowPipelineForMemStore \
  -Dexec.args="--project=my-project-id \
  --jobName=dataflow-job \
  --inputFile=gs://my-input-bucket/*.txt \
  --redisHost=127.0.0.1 \
  --stagingLocation=gs://pipeline-bucket/stage/ \
  --dataflowJobFile=gs://pipeline-bucket/templates/dataflow-template \
  --runner=DataflowRunner"

我试图使用下面的代码从，但它不工作我

PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
    p.apply(TextIO.Read.from("gs://..."))
     .apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);

PipelineOptions=。。。；
DirectPipelineRunner=DirectPipelineRunner.fromOptions（选项）；
Pipeline p=Pipeline.create（选项）；
收集计数=
p、 应用（TextIO.Read.from（“gs:/…”）
.apply（Count.globally（））；
DirectPipelineRunner.EvaluationResults=runner.run（p）；
长计数=results.getPCollection（countPC）.get（0）；

我也阅读了apachebeam文档，但没有发现任何有用的东西。在此方面的任何帮助都将不胜感激。

正确的方法是使用Beam连接器（或Beam ParDo）将计数写入存储系统。管道结果不能直接提供给主程序，因为Beam runner可以并行计算，并且不能在同一台计算机上执行

例如（伪代码）：

p.apply（TextIO.Read.from（“gs://…”）
.apply（Count.globally（））
.apply（ParDo（MyLongToStringParDo（）））
.apply（TextIO.Write.to（“gs:/…”））；

如果您需要直接在主程序中处理输出，则可以在Beam程序结束后使用客户端库从GCS读取（在这种情况下，请确保指定

p.run（）.waitUntilFinish（）

）。或者，您可以将计算（需要计数）移动到Beam

p Transform

中，并将其作为管道的一部分。

我通过添加

count.global（）

并在管道读取文件后应用于

p收集来解决此问题
我添加了以下代码：
PCollection<String> lines = p.apply("Reading Lines...", TextIO.read().from(options.getInputFile()));

 lines.apply(Count.globally()).apply("Count the total records", ParDo.of(new RecordCount()));

PCollection lines=p.apply（“读取行…”，TextIO.read（）.from（options.getInputFile（））；
行.apply（Count.globally（））.apply（“计算总记录”，ParDo.of（new RecordCount（）））；

在这里，我创建了一个新类（RecordCount.java），它扩展了DoFn，DoFn只记录计数
RecordCount.java
import org.apache.beam.sdk.transforms.DoFn;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class RecordCount extends DoFn<Long, Void> {

    private static final Logger LOGGER = LoggerFactory.getLogger(RecordCount.class);

    @ProcessElement
    public void processElement(@Element Long count) {
       LOGGER.info("The total number of records in the input file is: ", count);

        }
    }

}

import org.apache.beam.sdk.transforms.DoFn；
导入org.slf4j.Logger；
导入org.slf4j.LoggerFactory；
公共类RecordCount扩展了DoFn{
私有静态最终记录器Logger=LoggerFactory.getLogger（RecordCount.class）；
@过程元素
public void processElement（@Element Long count）{
info（“输入文件中的记录总数为：”，count）；
}
}
}

import org.apache.beam.sdk.transforms.DoFn;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class RecordCount extends DoFn<Long, Void> {

    private static final Logger LOGGER = LoggerFactory.getLogger(RecordCount.class);

    @ProcessElement
    public void processElement(@Element Long count) {
       LOGGER.info("The total number of records in the input file is: ", count);

        }
    }

}