Google cloud dataflow 在Apache Beam中使用ValueProvider作为路径提取zip内容

Google cloud dataflow 在Apache Beam中使用ValueProvider作为路径提取zip内容,google-cloud-dataflow,apache-beam,Google Cloud Dataflow,Apache Beam,我有一个在谷歌云存储中提取.ZIP文件内容的代码。它工作正常,但我需要将此代码与运行时提供的文件路径一起使用(“gs://some\u bucket/filename.zip”)。当我尝试使用运行时值时,会出现如下错误: Exception in thread "main" java.lang.IllegalArgumentException: unable to serialize org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource@

我有一个在谷歌云存储中提取.ZIP文件内容的代码。它工作正常,但我需要将此代码与运行时提供的文件路径一起使用(“gs://some\u bucket/filename.zip”)。当我尝试使用运行时值时,会出现如下错误:

Exception in thread "main" java.lang.IllegalArgumentException: unable to serialize org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource@187bc24
    at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:53)
    at org.apache.beam.sdk.util.SerializableUtils.ensureSerializable(SerializableUtils.java:83)
    at org.apache.beam.sdk.io.Read$Bounded.<init>(Read.java:94)
    at org.apache.beam.sdk.io.Read$Bounded.<init>(Read.java:89)
    at org.apache.beam.sdk.io.Read.from(Read.java:48)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read.expand(BigQueryIO.java:535)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read.expand(BigQueryIO.java:292)
    at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:482)
    at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:422)
    at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:44)
    at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:164)
    at BeamTest2.StarterPipeline.main(StarterPipeline.java:180)
Caused by: java.io.NotSerializableException: org.apache.beam.sdk.Pipeline
    at java.io.ObjectOutputStream.writeObject0(Unknown Source)
    at java.io.ObjectOutputStream.defaultWriteFields(Unknown Source)
    at java.io.ObjectOutputStream.writeSerialData(Unknown Source)
    at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
    at java.io.ObjectOutputStream.writeObject0(Unknown Source)
    at java.io.ObjectOutputStream.defaultWriteFields(Unknown Source)
    at java.io.ObjectOutputStream.writeSerialData(Unknown Source)
    at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
    at java.io.ObjectOutputStream.writeObject0(Unknown Source)
    at java.io.ObjectOutputStream.defaultWriteFields(Unknown Source)
    at java.io.ObjectOutputStream.writeSerialData(Unknown Source)
    at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
    at java.io.ObjectOutputStream.writeObject0(Unknown Source)
    at java.io.ObjectOutputStream.writeObject(Unknown Source)
    at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:49)
    ... 11 more
线程“main”java.lang.IllegalArgumentException中的异常:无法序列化org.apache.beam.sdk.io.gcp.bigquery。BigQueryQuerySource@187bc24 位于org.apache.beam.sdk.util.SerializableUtils.SerializationToByteArray(SerializableUtils.java:53) 位于org.apache.beam.sdk.util.SerializableUtils.ensureSerializable(SerializableUtils.java:83) 位于org.apache.beam.sdk.io.Read$Bounded.(Read.java:94) 位于org.apache.beam.sdk.io.Read$Bounded.(Read.java:89) 位于org.apache.beam.sdk.io.Read.from(Read.java:48) 位于org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read.expand(BigQueryIO.java:535) 位于org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read.expand(BigQueryIO.java:292) 位于org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:482) 位于org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:422) 位于org.apache.beam.sdk.values.PBegin.apply(PBegin.java:44) 位于org.apache.beam.sdk.Pipeline.apply(Pipeline.java:164) 位于BeamTest2.StarterPipeline.main(StarterPipeline.java:180) 原因:java.io.NotSerializableException:org.apache.beam.sdk.Pipeline 位于java.io.ObjectOutputStream.WriteObject 0(未知源) 位于java.io.ObjectOutputStream.defaultWriteFields(未知源) 位于java.io.ObjectOutputStream.writeSerialData(未知源) 位于java.io.ObjectOutputStream.writeOrdinaryObject(未知源) 位于java.io.ObjectOutputStream.WriteObject 0(未知源) 位于java.io.ObjectOutputStream.defaultWriteFields(未知源) 位于java.io.ObjectOutputStream.writeSerialData(未知源) 位于java.io.ObjectOutputStream.writeOrdinaryObject(未知源) 位于java.io.ObjectOutputStream.WriteObject 0(未知源) 位于java.io.ObjectOutputStream.defaultWriteFields(未知源) 位于java.io.ObjectOutputStream.writeSerialData(未知源) 位于java.io.ObjectOutputStream.writeOrdinaryObject(未知源) 位于java.io.ObjectOutputStream.WriteObject 0(未知源) 位于java.io.ObjectOutputStream.writeObject(未知源) 位于org.apache.beam.sdk.util.SerializableUtils.SerializationToByteArray(SerializableUtils.java:49) ... 还有11个 我使用的代码是:

 //Unzip incoming file
      PCollection<TableRow> temp = p.apply(BigQueryIO.read().fromQuery(
      NestedValueProvider.of(
          options.getInputFile(),
          new SerializableFunction<String, String>() {
            private static final long serialVersionUID = 1L;
            @Override
            public String apply(String filepath) {
                try{

                 List<GcsPath> gcsPaths = util.expand(GcsPath.fromUri(filepath));
                 LOG.info(gcsPaths+"FilesUnzipped");
                  List<String> paths = new ArrayList<String>();

                  for(GcsPath gcsp: gcsPaths){
                      paths.add(gcsp.toString());
                  }
                  p.apply(Create.of(paths))
                      .apply(ParDo.of(new UnzipFN(filepath)));

                }
                catch(Exception e)
                {
                    LOG.info("Exception caught while extracting ZIP");
                }
                return "";
            }
          })).usingStandardSql().withoutValidation());
//解压缩传入文件
PCollection temp=p.apply(BigQueryIO.read().fromQuery(
NestedValueProvider.of(
options.getInputFile(),
新的SerializableFunction(){
私有静态最终长serialVersionUID=1L;
@凌驾
公共字符串应用(字符串文件路径){
试一试{
List gcsPaths=util.expand(GcsPath.fromUri(filepath));
LOG.info(gcsPaths+“文件解压缩”);
列表路径=新的ArrayList();
用于(GcsPath gcsp:gcsPaths){
add(gcsp.toString());
}
p、 应用(创建路径)
.apply(ParDo.of(new UnzipFN(filepath));
}
捕获(例外e)
{
LOG.info(“提取ZIP时捕获异常”);
}
返回“”;
}
})).usingStandardSql().withoutValidation());
UnzipFN类:

public class UnzipFN extends DoFn<String,Long>{
    private long filesUnzipped=0;
    @ProcessElement
    public void processElement(ProcessContext c){
        String p = c.element();
        GcsUtilFactory factory = new GcsUtilFactory();
        GcsUtil u = factory.create(c.getPipelineOptions());
        byte[] buffer = new byte[100000000];
        try{
            SeekableByteChannel sek = u.open(GcsPath.fromUri(p));
            InputStream is = Channels.newInputStream(sek);
            BufferedInputStream bis = new BufferedInputStream(is);
            ZipInputStream zis = new ZipInputStream(bis);
            ZipEntry ze = zis.getNextEntry();
            while(ze!=null){
                LOG.info("Unzipping File {}",ze.getName());
                WritableByteChannel wri = u.create(GcsPath.fromUri("gs://bucket_location/" + ze.getName()), getType(ze.getName()));
                OutputStream os = Channels.newOutputStream(wri);
                int len;
                while((len=zis.read(buffer))>0){
                    os.write(buffer,0,len);
                }
                os.close();
                filesUnzipped++;
                ze=zis.getNextEntry();


            }
            zis.closeEntry();
            zis.close();

        }
        catch(Exception e){
            e.printStackTrace();
        }
    c.output(filesUnzipped);
    System.out.println(filesUnzipped+"FilesUnzipped");
    LOG.info("FilesUnzipped");
    }

    private String getType(String fName){
        if(fName.endsWith(".zip")){
            return "application/x-zip-compressed";
        }
        else {
            return "text/plain";
        }
    }
}
public类UnzipFN扩展了DoFn{
私有长文件解压缩=0;
@过程元素
公共void processElement(ProcessContext c){
字符串p=c.element();
GcsUtilFactory=新的GcsUtilFactory();
GcsUtil u=factory.create(c.getPipelineOptions());
字节[]缓冲区=新字节[100000000];
试一试{
seekablebytechnel sek=u.open(GcsPath.fromUri(p));
InputStream is=通道。newInputStream(sek);
BufferedInputStream bis=新的BufferedInputStream(is);
ZipInputStream zis=新的ZipInputStream(bis);
ZipEntry ze=zis.getNextEntry();
while(ze!=null){
info(“解压缩文件{}”,ze.getName());
WritableByteChannel wri=u.create(GcsPath.fromUri(“gs://bucket\u location/”+ze.getName()),getType(ze.getName());
OutputStream os=Channels.newOutputStream(wri);
内伦;
而((len=zis.read(buffer))>0){
写操作(缓冲区,0,len);
}
os.close();
文件解压缩++;
ze=zis.getnextery();
}
zis.closeEntry();
zis.close();
}
捕获(例外e){
e、 printStackTrace();
}
c、 输出(文件解压缩);
System.out.println(fileunzip+“fileunzip”);
LOG.info(“文件解压缩”);
}
私有字符串getType(字符串fName){
if(fName.endsWith(“.zip”)){
返回“application/x-zip-compressed”;
}
否则{
返回“文本/纯文本”;
}
}
}
如何处理这种情况

另外,.zip提取代码与BigQueryIO.read()无关。我只是把它作为一种黑客手段,以便能够读取运行时值。如果你对此有任何其他建议,请告诉我


谢谢。

如果我理解正确,您有一个包含文件模式的
ValueProvider
,您正在使用
GcsUtil.expand()
扩展文件模式,并且您希望对每个生成的文件名应用一个函数(
UnzipFn

由于以下几个原因,当前代码将不起作用:

  • 您正在创建一个
    BigQueryIO.read().fromQuery()
    ,其中
    fromQuery()
    的参数是始终返回空字符串的
    ValueProvider
    (您的
    NestedValueProvider
    p.apply(Create.ofProvider(options.getInputFile(), StringUtf8Coder.of()))
     .apply(ParDo.of(new ExpandFn()))
     .apply(...fusion break...)
     .apply(ParDo.of(new UnzipFn()))
    
    p.apply(FileIO.match().filepattern(options.getInputFile()))
     .apply(...fusion break...)
     .apply(ParDo.of(new UnzipFn()));