Google cloud dataflow Google数据流:请求负载大小超出限制:10485760字节

Google cloud dataflow Google数据流:请求负载大小超出限制:10485760字节,google-cloud-dataflow,Google Cloud Dataflow,当我尝试在大约800.000个文件上运行大型转换时,我在尝试运行管道时收到上面的错误消息 代码如下: public static void main(String[] args) { Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); GcsUtil u = getUtil(p.getOptions()); try{

当我尝试在大约800.000个文件上运行大型转换时,我在尝试运行管道时收到上面的错误消息

代码如下:

public static void main(String[] args) {
Pipeline p = Pipeline.create(
    PipelineOptionsFactory.fromArgs(args).withValidation().create());    
    GcsUtil u = getUtil(p.getOptions());

    try{
        List<GcsPath> paths = u.expand(GcsPath.fromUri("gs://tlogdataflow/stage/*.zip"));
        List<String> strPaths = new ArrayList<String>();
        for(GcsPath pa: paths){
            strPaths.add(pa.toUri().toString());
        }           

        p.apply(Create.of(strPaths))
         .apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
        p.run();
    }
    catch(IOException io){
        //
    }
publicstaticvoidmain(字符串[]args){
Pipeline p=Pipeline.create(
PipelineOptionFactory.fromArgs(args).withValidation().create());
GcsUtil u=getUtil(p.getOptions());
试一试{
列表路径=u.expand(GcsPath.fromUri(“gs://tlogdataflow/stage/*.zip”);
List strPaths=new ArrayList();
用于(GcsPath pa:路径){
strPaths.add(pa.toUri().toString());
}           
p、 应用(创建(strPaths))
.apply(“解压文件”,Write.to(新的ZipIO.Sink(“gs://tlogdataflow/outbox”));
p、 run();
}
捕获(io异常){
//
}
}

我想这正是谷歌数据流的目的?处理大量文件/数据

有没有办法将负载分开以使其工作

谢谢&BR


Phil

数据流擅长处理大量数据,但在管道描述的大小方面存在局限性。传递到
Create.of()
的数据当前嵌入到管道描述中,因此您不能在那里传递大量数据-相反,应该从外部存储读取大量数据,管道应该只指定它们的位置

可以将其视为程序可以处理的数据量与程序代码本身大小之间的区别

您可以通过在
ParDo
中进行扩展来解决此问题:

p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
 .apply(ParDo.of(new ExpandFn()))
 .apply(...fusion break (see below)...)
 .apply(Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")))
其中
ExpandFn
如下所示:

private static class ExpandFn extends DoFn<String, String> {
  @ProcessElement
  public void process(ProcessContext c) {
    GcsUtil util = getUtil(c.getPipelineOptions());
    for (String path : util.expand(GcsPath.fromUri(c.element()))) {
      c.output(path);
    }
  }
}
私有静态类ExpandFn扩展DoFn{
@过程元素
公共作废流程(ProcessContext c){
GcsUtil util=getUtil(c.getPipelineOptions());
for(字符串路径:util.expand(GcsPath.fromUri(c.element())){
c、 输出(路径);
}
}
}

我指的是fusion break(基本上是,
ParDo(添加唯一键)
+
按键分组
+
Flatten.iterables()
+
Values.create())。这不是很方便,而且正在讨论如何添加内置转换来实现这一点(请参见和)。

非常感谢!使用您的输入,我解决了它,如下所示:

public class ZipPipeline {
private static final Logger LOG = LoggerFactory.getLogger(ZipPipeline.class);

public static void main(String[] args) {
Pipeline p = Pipeline.create(
    PipelineOptionsFactory.fromArgs(args).withValidation().create());    

    try{
        p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
         .apply(ParDo.of(new ExpandFN()))
         .apply(ParDo.of(new AddKeyFN()))
         .apply(GroupByKey.<String,String>create())
         .apply(ParDo.of(new FlattenFN()))
         .apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
        p.run();

    }
    catch(Exception e){
        LOG.error(e.getMessage());
    }

}

private static class FlattenFN extends DoFn<KV<String,Iterable<String>>, String>{
  private static final long serialVersionUID = 1L;
  @Override
  public void processElement(ProcessContext c){
      KV<String,Iterable<String>> kv = c.element();
      for(String s: kv.getValue()){
          c.output(s);
      }


      }

  }

private static class ExpandFN extends DoFn<String,String>{
private static final long serialVersionUID = 1L;

@Override
  public void processElement(ProcessContext c) throws Exception{
      GcsUtil u = getUtil(c.getPipelineOptions());
      for(GcsPath path : u.expand(GcsPath.fromUri(c.element()))){
          c.output(path.toUri().toString());
      }
  }
}

private static class AddKeyFN extends DoFn<String, KV<String,String>>{
  private static final long serialVersionUID = 1L;
  @Override
  public void processElement(ProcessContext c){
     String path = c.element();
     String monthKey = path.split("_")[4].substring(0, 6);
     c.output(KV.of(monthKey, path));
  }
}
公共类ZipPipeline{
私有静态最终记录器LOG=LoggerFactory.getLogger(ZipPipeline.class);
公共静态void main(字符串[]args){
Pipeline p=Pipeline.create(
PipelineOptionFactory.fromArgs(args).withValidation().create());
试一试{
p、 应用(Create.of(“gs://tlogdataflow/stage/*.zip”))
.apply(ParDo.of(new ExpandFN()))
.apply(ParDo.of(new AddKeyFN()))
.apply(GroupByKey.create())
.apply(第页,共页,共页,共页,共页)
.apply(“解压文件”,Write.to(新的ZipIO.Sink(“gs://tlogdataflow/outbox”));
p、 run();
}
捕获(例外e){
LOG.error(例如getMessage());
}
}
私有静态类flattfn扩展了DoFn{
私有静态最终长serialVersionUID=1L;
@凌驾
公共void processElement(ProcessContext c){
KV=c.元件();
对于(字符串s:kv.getValue()){
c、 产出;
}
}
}
私有静态类ExpandFN扩展DoFn{
私有静态最终长serialVersionUID=1L;
@凌驾
public void processElement(ProcessContext c)引发异常{
GcsUtil u=getUtil(c.getPipelineOptions());
for(GcsPath路径:u.expand(GcsPath.fromUri(c.element())){
c、 输出(path.toUri().toString());
}
}
}
私有静态类AddKeyFN扩展了DoFn{
私有静态最终长serialVersionUID=1L;
@凌驾
公共void processElement(ProcessContext c){
字符串路径=c.element();
字符串monthKey=path.split(“”)[4]。子字符串(0,6);
c、 输出(千伏(蒙特基,路径));
}
}