Java AWS Lambda：如何在S3存储桶中提取tgz文件并将其放入另一个S3存储桶中_Java_Amazon Web Services_Amazon S3_Aws Lambda

Java AWS Lambda：如何在S3存储桶中提取tgz文件并将其放入另一个S3存储桶中

java amazon-web-services amazon-s3 aws-lambda

Java AWS Lambda：如何在S3存储桶中提取tgz文件并将其放入另一个S3存储桶中,java,amazon-web-services,amazon-s3,aws-lambda,Java,Amazon Web Services,Amazon S3,Aws Lambda,我有一个名为“Source”的S3存储桶。许多“.tgz”文件被实时推送到该存储桶中。我编写了一个Java代码，用于提取“.tgz”文件并将其推入“目标”存储桶。我将代码作为Lambda函数推送。我在Java代码中将“.tgz”文件作为InputStream获取。如何在Lambda中提取它？我无法在Lambda中创建文件，它在JAVA中抛出“FileNotFound（Permission Denied）” AmazonS3 s3Client=新的AmazonS3客户端（）； S3Object S

我有一个名为“Source”的S3存储桶。许多“.tgz”文件被实时推送到该存储桶中。我编写了一个Java代码，用于提取“.tgz”文件并将其推入“目标”存储桶。我将代码作为Lambda函数推送。我在Java代码中将“.tgz”文件作为InputStream获取。如何在Lambda中提取它？我无法在Lambda中创建文件，它在JAVA中抛出“FileNotFound（Permission Denied）”

AmazonS3 s3Client=新的AmazonS3客户端（）；
S3Object S3Object=s3Client.getObject（新的GetObjectRequest（srcBucket，srcKey））；
InputStream objectData=s3Object.getObjectContent（）；
File File=新文件（s3Object.getKey（））；
OutputStream writer=new BufferedOutputStream（new FileOutputStream（file）） 不要使用文件
或文件输出流
，请使用s3Client.putObject（）
。要读取tgz文件，可以使用ApacheCommons压缩。例如：
ArchiveInputStream tar = new ArchiveInputStreamFactory().
    createArchiveInputStream("tar", new GZIPInputStream(objectData));
ArchiveEntry entry;
while ((entry = tar.getNextEntry()) != null) {
    if (!entry.isDirectory()) {
        byte[] objectBytes = new byte[entry.getSize()];
        tar.read(objectBytes);
        ObjectMetadata metadata = new ObjectMetadata();
        metadata.setContentLength(objectBytes.length);
        metadata.setContentType("application/octet-stream");
        s3Client.putObject(destBucket, entry.getName(), 
            new ByteArrayInputStream(objectBytes), metadata);
    }
}

使用Python 3.6并为后缀为“.tgz”的obejctcreated（all）触发一个事件。希望这对您有所帮助。
因为其中一个响应是用Python编写的，所以我用这种语言提供了另一种解决方案
使用/tmp文件系统的解决方案的问题是，AWS只允许在那里存储512 MB）。为了解压或解压更大的文件，最好使用io在内存中打包和分类并处理文件内容。AWS允许为Lambda分配高达3GB的RAM，这大大扩展了最大文件大小。我成功地用1GB S3文件测试了卸载
在我的例子中，将~2000个文件从1GB tar文件卸载到另一个S3存储桶需要140秒。它可以通过使用多个线程将非tarred文件上传到目标S3 bucket来进一步优化
下面的示例代码提供了单线程解决方案：
import boto3
import botocore
import tarfile

from io import BytesIO
s3_client = boto3.client('s3')

def untar_s3_file(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    input_tar_file = s3_client.get_object(Bucket = bucket, Key = key)
    input_tar_content = input_tar_file['Body'].read()

    with tarfile.open(fileobj = BytesIO(input_tar_content)) as tar:
        for tar_resource in tar:
            if (tar_resource.isfile()):
                inner_file_bytes = tar.extractfile(tar_resource).read()
                s3_client.upload_fileobj(BytesIO(inner_file_bytes), Bucket = bucket, Key = tar_resource.name)

您的建议是写回目标桶。但我的问题是如何在lambda函数中提取tgz？AWS或lambda在提取tgz
时没有什么特别之处。我已经使用标准Java库和Apache Commons Compress更新了我的答案。这将导致文件末尾出现空值，顺便说一句。tar.read（objectBytes）将读取缓冲区中的任何内容，但不能保证读取整个文件，因此objectBytes在末尾留下一堆空值。我可以使用这个。当然，我要提醒大家的是，512 MB的/tmp
存储最终促使我采用另一种解决方案。此外，请记住，upload\u file只上载一个文件，因此，如果您取消了包含多个文件的文件夹，则必须分别上载每个文件。
import boto3
import tarfile
from tarfile import TarInfo
import tempfile

s3_client = boto3.client('s3')
s3_resource=boto3.resource('s3')
def lambda_handler(event, context):
    bucket =event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    new_bucket='uncompressed-data' #new bucket name
    new_key=key[:-4]
    try:
        with tempfile.SpooledTemporaryFile(mode='w+t') as temp:
            s3_client.download_fileobj(bucket,key, temp)
            temp.seek(0)
            tar=tarfile.open(mode="r:gz", fileobj = temp)
            for TarInfo in tar:
                file_save=tar.extractfile(TarInfo.name)
                s3_client.upload_fileobj(file_save,new_bucket,new_key)
            tar.close()
            temp.close()
    except Exception as e:
        print(e)
        raise e

import boto3
import botocore
import tarfile

from io import BytesIO
s3_client = boto3.client('s3')

def untar_s3_file(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    input_tar_file = s3_client.get_object(Bucket = bucket, Key = key)
    input_tar_content = input_tar_file['Body'].read()

    with tarfile.open(fileobj = BytesIO(input_tar_content)) as tar:
        for tar_resource in tar:
            if (tar_resource.isfile()):
                inner_file_bytes = tar.extractfile(tar_resource).read()
                s3_client.upload_fileobj(BytesIO(inner_file_bytes), Bucket = bucket, Key = tar_resource.name)