Google cloud dataflow 使用数据流的求和和和平均聚合_Google Cloud Dataflow

Google cloud dataflow 使用数据流的求和和和平均聚合

google-cloud-dataflow

Google cloud dataflow 使用数据流的求和和和平均聚合,google-cloud-dataflow,Google Cloud Dataflow,我有以下类型的样本数据 s.n., time, user, time_span, user_level 1, 2016-01-04T1:26:13, Hari, 8, admin 2, 2016-01-04T11:6:13, Gita, 2, admin 3, 2016-01-04T11:26:13, Gita, 0, user 现在我需要找到平均时间span/用户，平均时间span/用户级别和总时间span/用户我能够找到上面提到的每一项价值，但无法同时找到所有这些价值。由于我是DataF

我有以下类型的样本数据

s.n., time, user, time_span, user_level
1, 2016-01-04T1:26:13, Hari, 8, admin
2, 2016-01-04T11:6:13, Gita, 2, admin
3, 2016-01-04T11:26:13, Gita, 0, user

现在我需要找到

平均时间span/用户

，

平均时间span/用户级别

和

总时间span/用户

我能够找到上面提到的每一项价值，但无法同时找到所有这些价值。由于我是DataFlow新手，请向我推荐合适的方法

static class ExtractUserAndUserLevelFn extends DoFn<String, KV<String, Long>> {
        @Override
        public void processElement(ProcessContext c) {

            String[] words = c.element().split(",");

            if (words.length == 5) {
                Instant timestamp = Instant.parse(words[1].trim());                    
                KV<String, Long> userTime = KV.of(words[2].trim(), Long.valueOf(words[3].trim()));
                KV<String, Long> userLevelTime = KV.of(words[4].trim(), Long.valueOf(words[3].trim()));                    
                c.outputWithTimestamp(userTime, timestamp);
                c.outputWithTimestamp(userLevelTime, timestamp);

            }
        }
    }


public static void main(String[] args) {
    TestOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
            .as(TestOptions.class);
    Pipeline p = Pipeline.create(options);
    p.apply(TextIO.Read.named("ReadLines").from(options.getInputFile()))
            .apply(ParDo.of(new ExtractUserAndUserLevelFn()))
            .apply(Window.<KV<String, Long>>into(
                    FixedWindows.of(Duration.standardSeconds(options.getMyWindowSize()))))
            .apply(GroupByKey.<String, Long>create())
            .apply(ParDo.of(new DoFn<KV<String, Iterable<Long>>, KV<String, Long>>() {
                public void processElement(ProcessContext c) {
                    String key = c.element().getKey();
                    Iterable<Long> docsWithThatUrl = c.element().getValue();
                    Long sum = 0L;
                    for (Long item : docsWithThatUrl)
                        sum += item;
                    KV<String, Long> userTime = KV.of(key, sum);
                    c.output(userTime);
                }
            }))
            .apply(MapElements.via(new FormatAsTextFn()))
            .apply(TextIO.Write.named("WriteCounts").to(options.getOutput()).
                    withNumShards(options.getShardsNumber()));

    p.run();
}

静态类ExtractUserAndUserLevelFn扩展了DoFn{
@凌驾
公共void processElement（ProcessContext c）{
String[]words=c.element（）.split（“，”）；
if（words.length==5）{
Instant timestamp=Instant.parse（单词[1].trim（））；
KV userTime=KV.of（words[2].trim（），Long.valueOf（words[3].trim（））；
KV userLevelTime=KV.of（words[4].trim（），Long.valueOf（words[3].trim（））；
c、 outputWithTimestamp（用户时间，时间戳）；
c、 outputWithTimestamp（userLevelTime，timestamp）；
}
}
}
公共静态void main（字符串[]args）{
TestOptions=PipelineOptionsFactory.fromArgs（args）.withValidation（）
.as（TestOptions.class）；
Pipeline p=Pipeline.create（选项）；
p、 从（options.getInputFile（））应用（TextIO.Read.named（“ReadLines”）.from）
.apply（ParDo.of（new ExtractUserAndUserLevelFn（））
应用于(
FixedWindows.of（持续时间.standardSeconds（options.getMyWindowsSize（）））
.apply（GroupByKey.create（））
.适用（新DoFn（）的第{
公共void processElement（ProcessContext c）{
String key=c.element（）.getKey（）；
Iterable docsWithThatUrl=c.element（）.getValue（）；
长和=0L；
用于（长项目：docsWithThatUrl）
总和+=项目；
KV用户时间=KV of（键，和）；
c、 输出（用户时间）；
}
}))
.apply（MapElements.via（新格式ASTEXTFN（）））
.apply（TextIO.Write.named（“WriteCounts”）.to（options.getOutput（））。
withNumShards（options.getShardsNumber（））；
p、 run（）；
}

和转换看起来很适合这个用例。基本用法如下所示：

 PCollection<KV<String, Double>> meanPerKey =
     input.apply(Mean.<String, Integer>perKey());

 PCollection<KV<String, Integer>> sumPerKey = input
     .apply(Sum.<String>integersPerKey());

static class Record implements Serializable {
  final String user;
  final String role;
  final long duration;
  // need a constructor here
}

PCollection meanPerKey=
input.apply（Mean.perKey（））；
PCollection sumPerKey=输入
.apply（Sum.integersPerKey（））；

一种方法是首先将行解析为一个PCollection，其中每行包含一条记录，然后从该集合创建两个PCollection键值对。假设您定义了一条记录，表示如下所示的行：

 PCollection<KV<String, Double>> meanPerKey =
     input.apply(Mean.<String, Integer>perKey());

 PCollection<KV<String, Integer>> sumPerKey = input
     .apply(Sum.<String>integersPerKey());

static class Record implements Serializable {
  final String user;
  final String role;
  final long duration;
  // need a constructor here
}

现在，创建一个LineToRecordFn，用于从输入行创建记录，以便执行以下操作：

PCollection<Record> records = p.apply(TextIO.Read.named("ReadLines")
                               .from(options.getInputFile()))
                               .apply(ParDo.of(new LineToRecordFn()));

PCollection records=p.apply（TextIO.Read.named（“ReadLines”）
.from（options.getInputFile（）））
.apply（第（4）部分新的行记录fn（））；

如果你想的话，你可以在这里开窗户。无论是否打开该窗口，都可以创建“按角色设置关键帧”和“按用户设置关键帧”的PCollections：

PCollection<KV<String,Long>> role_duration = records.apply(MapElements.via(
    new SimpleFunction<Record,KV<String,Long>>() {
          @Override
          public KV<String,Long> apply(Record r) {
            return KV.of(r.role,r.duration);
          }
        }));

PCollection<KV<String,Long>> user_duration = records.apply(MapElements.via(
    new SimpleFunction<Record,KV<String,Long>>() {
              @Override
              public KV<String,Long> apply(Record r) {
                return KV.of(r.user, r.duration);
              }
            }));

PCollection role\u duration=records.apply（MapElements.via(
新的SimpleFunction（）{
@凌驾
公共KV应用（记录r）{
返回KV.of（r.角色，r.持续时间）；
}
}));
PCollection user_duration=records.apply（MapElements.via(
新的SimpleFunction（）{
@凌驾
公共KV应用（记录r）{
返回KV.of（r.用户，r.持续时间）；
}
}));

现在，您只需几行就可以得到平均值和总和：

PCollection<KV<String,Double>> mean_by_user = user_duration.apply(
    Mean.<String,Long>perKey());
PCollection<KV<String,Double>> mean_by_role = role_duration.apply(
    Mean.<String,Long>perKey()); 
PCollection<KV<String,Long>> sum_by_role = role_duration.apply(
    Sum.<String>longsPerKey());

PCollection mean\u by\u user=user\u duration.apply(
Mean.perKey（））；
PCollection mean_by_role=role_duration.apply(
Mean.perKey（））；
PCollection sum_by_role=role_duration.apply(
Sum.longsPerKey（））；

请注意，数据流在运行作业之前会进行一些优化。因此，虽然看起来您正在对records PCollection进行两次传递，但这可能不是真的

但我需要找到不同列的平均值和不同列值的平均值。如何在单个程序中做到这一点。您可能希望将每个程序作为一个单独的PCollection进行处理，从原来的PCollection分支而来。我可以使用。