Compression 解压数据流中的.tar文件？_Compression_Google Cloud Dataflow_Apache Beam_Tar

Compression 解压数据流中的.tar文件？

compression google-cloud-dataflow

Compression 解压数据流中的.tar文件？,compression,google-cloud-dataflow,apache-beam,tar,Compression,Google Cloud Dataflow,Apache Beam,Tar,我的GCP云存储桶中有很多.tar文件。每个.tar文件都有多个层。我想使用GCP数据流解压这些.tar文件，并将它们放回另一个GCP存储桶中我发现Google提供的用于批量解压缩云存储文件的实用程序模板，但它不支持.tar文件扩展名也许我应该在上传到云之前尝试解压这些文件，或者Beam中是否存在其他东西每个tar文件大约15 TB未压缩。此片段借用了。它还借用了正如您所注意到的，TAR不受支持，但一般来说，Beam中的压缩/解压似乎依赖于您可以编写一个管道来执行以下操作： // Cr

我的GCP云存储桶中有很多.tar文件。每个.tar文件都有多个层。我想使用GCP数据流解压这些.tar文件，并将它们放回另一个GCP存储桶中

我发现Google提供的用于批量解压缩云存储文件的实用程序模板，但它不支持.tar文件扩展名

也许我应该在上传到云之前尝试解压这些文件，或者Beam中是否存在其他东西

每个tar文件大约15 TB未压缩。

此片段借用了。它还借用了

正如您所注意到的，TAR不受支持，但一般来说，Beam中的压缩/解压似乎依赖于

您可以编写一个管道来执行以下操作：

// Create the pipeline
Pipeline pipeline = Pipeline.create(options);

// Run the pipeline over the work items.
PCollectionTuple decompressOut =
    pipeline
        .apply("MatchFile(s)",
            FileIO.match().filepattern(options.getInputFilePattern()))
        .apply(
            "DecompressFile(s)",
            ParDo.of(new Decompress(options.getOutputDirectory());

您的

解压

DoFn如下所示：

// Create the pipeline
Pipeline pipeline = Pipeline.create(options);

// Run the pipeline over the work items.
PCollectionTuple decompressOut =
    pipeline
        .apply("MatchFile(s)",
            FileIO.match().filepattern(options.getInputFilePattern()))
        .apply(
            "DecompressFile(s)",
            ParDo.of(new Decompress(options.getOutputDirectory());

class Dearchive扩展了DoFn{
@过程元素
公共无效进程（@Context ProcessContext）{
ResourceId inputFile=context.element（）.ResourceId（）；
字符串outputFilename=Files.getNameWithoutExtension（inputFile.toString（））；
资源ID tempFileDir=
this.outputDir.resolve（outputFilename，StandardResolveOptions.resolve\u目录）；
TarArchiveInputStream tarInput=新的TarArchiveInputStream(
Channels.newInputStream（FileSystems.open（inputFile））；
TarArchiveEntry currentEntry=tarInput.getNextAttarEntry（）；
while（currentEntry！=null）{
br=新的BufferedReader（新的InputStreamReader（tarInput））；//直接读取
ResourceId outputFile=tempFileDir.resolve（currentEntry.getName（），
StandardResolveOptions.RESOLVE_文件）；
try（WritableByteChannel writerChannel=FileSystems.create（tempFile，MimeTypes.TEXT））{
复制（tarInput，Channels.newOutputStream（writerChannel））；
}
output（outputFile.toString（））；
currentEntry=tarInput.GetNextArentry（）；//迭代到下一个文件
}
}
}

这是一段非常粗糙且未经测试的代码片段，但它应该能让您从正确的方向开始。LMK，如果我们需要进一步澄清的话。

+1，数据流模板不仅对直接运行有用，而且还提供了大量经过良好测试的管道，您可以根据需要进行修改。