Python 3.x 高效地查找S3存储桶上的所有.zip文件_Python 3.x_Amazon Web Services_Amazon S3_Boto3

Python 3.x 高效地查找S3存储桶上的所有.zip文件

python-3.x amazon-web-services amazon-s3

Python 3.x 高效地查找S3存储桶上的所有.zip文件,python-3.x,amazon-web-services,amazon-s3,boto3,Python 3.x,Amazon Web Services,Amazon S3,Boto3,我有一堆S3存储桶，里面堆满了.zip格式的旧文件和档案。我想有效地查询一个bucket，并获得一个所有压缩文件的列表，这些文件的大小超过（比如）200MB，然后删除它们所以我写了一些代码。它能完成任务，但速度很慢。S3上的文件越多，API调用越多，等待时间越长。对于一个包含70多个文件的bucket，大约需要50秒才能确定（在本例中）3个zip文件 #!/usr/bin/env python3.6 import boto3 from botocore.exceptions import Cl

我有一堆S3存储桶，里面堆满了.zip格式的旧文件和档案。我想有效地查询一个bucket，并获得一个所有压缩文件的列表，这些文件的大小超过（比如）200MB，然后删除它们

所以我写了一些代码。它能完成任务，但速度很慢。S3上的文件越多，API调用越多，等待时间越长。对于一个包含70多个文件的bucket，大约需要50秒才能确定（在本例中）3个zip文件

#!/usr/bin/env python3.6
import boto3
from botocore.exceptions import ClientError


def find_all_zips(bucket: str) -> iter:
    print(f"Looking for .zip files on S3: {bucket} ...")
    b = boto3.resource("s3").Bucket(bucket)
    return (obj.key for obj in b.objects.all()
            if get_info(bucket=bucket, key=obj.key) is not None)


def get_info(bucket: str, key: str) -> str:
    s3 = boto3.client('s3')
    try:
        response = s3.head_object(Bucket=bucket, Key=key)
        has_size = response['ContentLength'] >= 209715200 # ~= 200MB in bytes
        if len(response['ContentType']) == 0:
            is_zip = False
        else:
            is_zip = response['ContentType'].split("/")[1] == 'zip'

        if has_size and is_zip:
            return key
    except ClientError as error:
        raise Exception(f"Failed to fetch file info for {key}: {error}")


if __name__ == "__main__":
    print(list(find_all_zips(bucket='MYBUCKET')))

我得到的输出是我所期望的：

Looking for .zip files on S3: MYBUCKET ...
['avocado-prices.zip', 'notepad.zip', 'spacerace.zip']

问题：有没有办法加速这件事？或者，我应该启动一个数据库来记录我的S3文件及其类型吗？

如果您愿意使用文件名来标识Zip文件，则不需要对head_对象进行额外调用：

进口boto3 s3_resource=boto3.资源's3' bucket=s3\u资源。bucket'my\u bucket' 最大尺寸=2*1024*1024 bucket.objects.all中对象的printlistobject.key 如果object.size>=max_size和object.key.endswith'.zip'

代码是从lambda函数运行的？不，我在本地机器上运行它。考虑启用和查询S3清单报告。链接：慢速是由于网络和可能是您的Internet连接。你能做的两件事。首先，尝试从EC2或Lambda运行代码。其次，在调用head_对象检查大小之前，使用键检查zip文件。这将减少需要进行的API调用的数量。