Google cloud dataflow 在Apache Beam中使用ValueProvider作为路径提取zip内容_Google Cloud Dataflow_Apache Beam

Google cloud dataflow 在Apache Beam中使用ValueProvider作为路径提取zip内容

google-cloud-dataflow

Google cloud dataflow 在Apache Beam中使用ValueProvider作为路径提取zip内容,google-cloud-dataflow,apache-beam,Google Cloud Dataflow,Apache Beam,我有一个在谷歌云存储中提取.ZIP文件内容的代码。它工作正常，但我需要将此代码与运行时提供的文件路径一起使用（“gs://some\u bucket/filename.zip”）。当我尝试使用运行时值时，会出现如下错误： Exception in thread "main" java.lang.IllegalArgumentException: unable to serialize org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource@

我有一个在谷歌云存储中提取.ZIP文件内容的代码。它工作正常，但我需要将此代码与运行时提供的文件路径一起使用（“gs://some\u bucket/filename.zip”）。当我尝试使用运行时值时，会出现如下错误：

Exception in thread "main" java.lang.IllegalArgumentException: unable to serialize org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource@187bc24
    at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:53)
    at org.apache.beam.sdk.util.SerializableUtils.ensureSerializable(SerializableUtils.java:83)
    at org.apache.beam.sdk.io.Read$Bounded.<init>(Read.java:94)
    at org.apache.beam.sdk.io.Read$Bounded.<init>(Read.java:89)
    at org.apache.beam.sdk.io.Read.from(Read.java:48)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read.expand(BigQueryIO.java:535)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read.expand(BigQueryIO.java:292)
    at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:482)
    at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:422)
    at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:44)
    at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:164)
    at BeamTest2.StarterPipeline.main(StarterPipeline.java:180)
Caused by: java.io.NotSerializableException: org.apache.beam.sdk.Pipeline
    at java.io.ObjectOutputStream.writeObject0(Unknown Source)
    at java.io.ObjectOutputStream.defaultWriteFields(Unknown Source)
    at java.io.ObjectOutputStream.writeSerialData(Unknown Source)
    at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
    at java.io.ObjectOutputStream.writeObject0(Unknown Source)
    at java.io.ObjectOutputStream.defaultWriteFields(Unknown Source)
    at java.io.ObjectOutputStream.writeSerialData(Unknown Source)
    at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
    at java.io.ObjectOutputStream.writeObject0(Unknown Source)
    at java.io.ObjectOutputStream.defaultWriteFields(Unknown Source)
    at java.io.ObjectOutputStream.writeSerialData(Unknown Source)
    at java.io.ObjectOutputStream.writeOrdinaryObject(Unknown Source)
    at java.io.ObjectOutputStream.writeObject0(Unknown Source)
    at java.io.ObjectOutputStream.writeObject(Unknown Source)
    at org.apache.beam.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:49)
    ... 11 more

线程“main”java.lang.IllegalArgumentException中的异常：无法序列化org.apache.beam.sdk.io.gcp.bigquery。BigQueryQuerySource@187bc24 位于org.apache.beam.sdk.util.SerializableUtils.SerializationToByteArray（SerializableUtils.java:53）位于org.apache.beam.sdk.util.SerializableUtils.ensureSerializable（SerializableUtils.java:83）位于org.apache.beam.sdk.io.Read$Bounded.（Read.java:94）位于org.apache.beam.sdk.io.Read$Bounded.（Read.java:89）位于org.apache.beam.sdk.io.Read.from（Read.java:48）位于org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read.expand（BigQueryIO.java:535）位于org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Read.expand（BigQueryIO.java:292）位于org.apache.beam.sdk.Pipeline.applyInternal（Pipeline.java:482）位于org.apache.beam.sdk.Pipeline.applyTransform（Pipeline.java:422）位于org.apache.beam.sdk.values.PBegin.apply（PBegin.java:44）位于org.apache.beam.sdk.Pipeline.apply（Pipeline.java:164）位于BeamTest2.StarterPipeline.main（StarterPipeline.java:180）原因：java.io.NotSerializableException:org.apache.beam.sdk.Pipeline 位于java.io.ObjectOutputStream.WriteObject 0（未知源）位于java.io.ObjectOutputStream.defaultWriteFields（未知源）位于java.io.ObjectOutputStream.writeSerialData（未知源）位于java.io.ObjectOutputStream.writeOrdinaryObject（未知源）位于java.io.ObjectOutputStream.WriteObject 0（未知源）位于java.io.ObjectOutputStream.defaultWriteFields（未知源）位于java.io.ObjectOutputStream.writeSerialData（未知源）位于java.io.ObjectOutputStream.writeOrdinaryObject（未知源）位于java.io.ObjectOutputStream.WriteObject 0（未知源）位于java.io.ObjectOutputStream.defaultWriteFields（未知源）位于java.io.ObjectOutputStream.writeSerialData（未知源）位于java.io.ObjectOutputStream.writeOrdinaryObject（未知源）位于java.io.ObjectOutputStream.WriteObject 0（未知源）位于java.io.ObjectOutputStream.writeObject（未知源）位于org.apache.beam.sdk.util.SerializableUtils.SerializationToByteArray（SerializableUtils.java:49） ... 还有11个我使用的代码是：

 //Unzip incoming file
      PCollection<TableRow> temp = p.apply(BigQueryIO.read().fromQuery(
      NestedValueProvider.of(
          options.getInputFile(),
          new SerializableFunction<String, String>() {
            private static final long serialVersionUID = 1L;
            @Override
            public String apply(String filepath) {
                try{

                 List<GcsPath> gcsPaths = util.expand(GcsPath.fromUri(filepath));
                 LOG.info(gcsPaths+"FilesUnzipped");
                  List<String> paths = new ArrayList<String>();

                  for(GcsPath gcsp: gcsPaths){
                      paths.add(gcsp.toString());
                  }
                  p.apply(Create.of(paths))
                      .apply(ParDo.of(new UnzipFN(filepath)));

                }
                catch(Exception e)
                {
                    LOG.info("Exception caught while extracting ZIP");
                }
                return "";
            }
          })).usingStandardSql().withoutValidation());

//解压缩传入文件
PCollection temp=p.apply（BigQueryIO.read（）.fromQuery(
NestedValueProvider.of(
options.getInputFile（），
新的SerializableFunction（）{
私有静态最终长serialVersionUID=1L；
@凌驾
公共字符串应用（字符串文件路径）{
试一试{
List gcsPaths=util.expand（GcsPath.fromUri（filepath））；
LOG.info（gcsPaths+“文件解压缩”）；
列表路径=新的ArrayList（）；
用于（GcsPath gcsp:gcsPaths）{
add（gcsp.toString（））；
}
p、 应用（创建路径）
.apply（ParDo.of（new UnzipFN（filepath））；
}
捕获（例外e）
{
LOG.info（“提取ZIP时捕获异常”）；
}
返回“”；
}
})).usingStandardSql（）.withoutValidation（））；

UnzipFN类：

public class UnzipFN extends DoFn<String,Long>{
    private long filesUnzipped=0;
    @ProcessElement
    public void processElement(ProcessContext c){
        String p = c.element();
        GcsUtilFactory factory = new GcsUtilFactory();
        GcsUtil u = factory.create(c.getPipelineOptions());
        byte[] buffer = new byte[100000000];
        try{
            SeekableByteChannel sek = u.open(GcsPath.fromUri(p));
            InputStream is = Channels.newInputStream(sek);
            BufferedInputStream bis = new BufferedInputStream(is);
            ZipInputStream zis = new ZipInputStream(bis);
            ZipEntry ze = zis.getNextEntry();
            while(ze!=null){
                LOG.info("Unzipping File {}",ze.getName());
                WritableByteChannel wri = u.create(GcsPath.fromUri("gs://bucket_location/" + ze.getName()), getType(ze.getName()));
                OutputStream os = Channels.newOutputStream(wri);
                int len;
                while((len=zis.read(buffer))>0){
                    os.write(buffer,0,len);
                }
                os.close();
                filesUnzipped++;
                ze=zis.getNextEntry();


            }
            zis.closeEntry();
            zis.close();

        }
        catch(Exception e){
            e.printStackTrace();
        }
    c.output(filesUnzipped);
    System.out.println(filesUnzipped+"FilesUnzipped");
    LOG.info("FilesUnzipped");
    }

    private String getType(String fName){
        if(fName.endsWith(".zip")){
            return "application/x-zip-compressed";
        }
        else {
            return "text/plain";
        }
    }
}

public类UnzipFN扩展了DoFn{
私有长文件解压缩=0；
@过程元素
公共void processElement（ProcessContext c）{
字符串p=c.element（）；
GcsUtilFactory=新的GcsUtilFactory（）；
GcsUtil u=factory.create（c.getPipelineOptions（））；
字节[]缓冲区=新字节[100000000]；
试一试{
seekablebytechnel sek=u.open（GcsPath.fromUri（p））；
InputStream is=通道。newInputStream（sek）；
BufferedInputStream bis=新的BufferedInputStream（is）；
ZipInputStream zis=新的ZipInputStream（bis）；
ZipEntry ze=zis.getNextEntry（）；
while（ze！=null）{
info（“解压缩文件{}”，ze.getName（））；
WritableByteChannel wri=u.create（GcsPath.fromUri（“gs://bucket\u location/”+ze.getName（）），getType（ze.getName（））；
OutputStream os=Channels.newOutputStream（wri）；
内伦；
而（（len=zis.read（buffer））>0）{
写操作（缓冲区，0，len）；
}
os.close（）；
文件解压缩++；
ze=zis.getnextery（）；
}
zis.closeEntry（）；
zis.close（）；
}
捕获（例外e）{
e、 printStackTrace（）；
}
c、 输出（文件解压缩）；
System.out.println（fileunzip+“fileunzip”）；
LOG.info（“文件解压缩”）；
}
私有字符串getType（字符串fName）{
if（fName.endsWith（“.zip”））{
返回“application/x-zip-compressed”；
}
否则{
返回“文本/纯文本”；
}
}
}

如何处理这种情况

另外，.zip提取代码与BigQueryIO.read（）无关。我只是把它作为一种黑客手段，以便能够读取运行时值。如果你对此有任何其他建议，请告诉我

谢谢。

如果我理解正确，您有一个包含文件模式的

ValueProvider

，您正在使用

GcsUtil.expand（）

扩展文件模式，并且您希望对每个生成的文件名应用一个函数（

UnzipFn

）

由于以下几个原因，当前代码将不起作用：

您正在创建一个

BigQueryIO.read（）.fromQuery（）

，其中

fromQuery（）

的参数是始终返回空字符串的

ValueProvider

（您的

NestedValueProvider

，

p.apply(Create.ofProvider(options.getInputFile(), StringUtf8Coder.of()))
 .apply(ParDo.of(new ExpandFn()))
 .apply(...fusion break...)
 .apply(ParDo.of(new UnzipFn()))

p.apply(FileIO.match().filepattern(options.getInputFile()))
 .apply(...fusion break...)
 .apply(ParDo.of(new UnzipFn()));