Python 使用Boto3从S3下载文件夹_Python_Amazon S3_Download_Boto3

Python 使用Boto3从S3下载文件夹

python amazon-s3 download

Python 使用Boto3从S3下载文件夹,python,amazon-s3,download,boto3,Python,Amazon S3,Download,Boto3,使用Boto3 Python SDK，我能够使用bucket.download_file（）有没有办法下载整个文件夹？又快又脏，但它可以：导入boto3 导入操作系统 def downloadDirectoryFroms3（bucketName、remoteDirectoryName）： s3_resource=boto3.resource（'s3'） bucket=s3_资源.bucket（bucketName）对于bucket.objects.filter（前缀=remoteDirec

使用

Boto3 Python SDK

，我能够使用

bucket.download_file（）

有没有办法下载整个文件夹？

又快又脏，但它可以：

导入boto3
导入操作系统
def downloadDirectoryFroms3（bucketName、remoteDirectoryName）：
s3_resource=boto3.resource（'s3'）
bucket=s3_资源.bucket（bucketName）
对于bucket.objects.filter（前缀=remoteDirectoryName）中的obj：
如果不存在os.path.exists（os.path.dirname（obj.key））：
os.makedirs（os.path.dirname（obj.key））
bucket.download_文件（obj.key，obj.key）#保存到同一路径

假设您想从s3下载目录foo/bar，则for循环将迭代路径以前缀=foo/bar开头的所有文件。

使用

boto3

您可以设置aws凭据并从s3下载数据集

导入boto3
导入操作系统
#设置aws凭据
s3r=boto3.resource（'s3'，aws\u access\u key\u id='XXXXXXXXXXXXXXXXXX'，
aws_secret_access_key='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'）
bucket=s3r.bucket（'bucket\u name'）
#下载文件夹
前缀='dirname'
对于bucket.objects.filter（前缀='dirname'）中的对象：
如果object.key==前缀：
os.makedirs（os.path.dirname（object.key），exist\u ok=True）
继续；
bucket.download_文件（object.key，object.key）

如果您找不到您的

access\u key

和

secret\u access\u key

，请参阅此
我希望这会有帮助

谢谢。

对康斯坦蒂诺斯·卡桑托尼斯的公认答案稍作修改：

导入boto3
s3=boto3.resource（'s3'）#假设凭证和配置在python之外的.aws目录或环境变量中处理
def下载_s3_文件夹（bucket_名称，s3_文件夹，local_dir=None）：
"""
下载文件夹目录的内容
Args：
bucket\u name：s3 bucket的名称
s3_文件夹：s3存储桶中的文件夹路径
本地目录：本地文件系统中的相对或绝对目录路径
"""
bucket=s3.bucket（bucket\u名称）
对于bucket.objects.filter（Prefix=s3\u文件夹）中的obj：
如果local_dir为None，则target=obj.key\
else os.path.join（local_dir，os.path.relpath（obj.key，s3_文件夹））
如果不存在os.path.exists（os.path.dirname（target））：
os.makedirs（os.path.dirname（目标））
如果对象键[-1]='/'：
持续
bucket.download_文件（obj.key，target）

这也会下载嵌套的子目录。我能够下载一个包含3000多个文件的目录。您可以在找到其他解决方案，但我不知道它们是否更好。

另一种基于@bjc答案的方法，它利用内置路径库并为您解析S3URI：

import boto3
from pathlib import Path
from urllib.parse import urlparse

def download_s3_folder(s3_uri, local_dir=None):
    """
    Download the contents of a folder directory
    Args:
        s3_uri: the s3 uri to the top level of the files you wish to download
        local_dir: a relative or absolute directory path in the local file system
    """
    s3 = boto3.resource("s3")
    bucket = s3.Bucket(urlparse(s3_uri).hostname)
    s3_path = urlparse(s3_uri).path.lstrip('/')
    if local_dir is not None:
        local_dir = Path(local_dir)
    for obj in bucket.objects.filter(Prefix=s3_path):
        target = obj.key if local_dir is None else local_dir / Path(obj.key).relative_to(s3_path)
        target.parent.mkdir(parents=True, exist_ok=True)
        if obj.key[-1] == '/':
            continue
        bucket.download_file(obj.key, str(target))

以上解决方案都很好，并且依赖S3资源。
下面的解决方案实现了相同的目标，但是使用了s3_客户端。
您可能会发现它对您的目的很有用（我已经测试过了，而且效果很好）

对于S3，您还可以使用它包装

boto3

。对于您的用例，它非常简单：

从cloudpathlib导入CloudPath
cp=CloudPath（“s3://bucket/folder/folder2/”）
cp.download_to（“本地_文件夹”）

可能重复-可能重复，但您没有设置凭据@Arkady凭据在~/.aws/credentials下设置，或者作为环境变量，您可以找到更多信息凭据可以以不同的方式设置。请参见，在创建s3资源时，您可以按照以下方式声明aws凭据：

s3\u resource=boto3.resource（'s3'，aws\u access\u key\u id=access\u key，aws\u secret\u access\u key=secret\u key）

要使此递归（对于目录中的目录），仅在不使用obj.key.endswith（'/'）时下载文件：最好避免将密钥放入代码文件中。在最坏的情况下，您可以将密钥放在单独的受保护文件中并导入它们。也可以在不缓存任何凭据的情况下使用boto3，而是使用s3fs或仅依赖配置文件（）

import boto3
from os import path, makedirs
from botocore.exceptions import ClientError
from boto3.exceptions import S3TransferFailedError

def download_s3_folder(s3_folder, local_dir, aws_access_key_id, aws_secret_access_key, aws_bucket, debug_en):
    """ Download the contents of a folder directory into a local area """

    success = True

    print('[INFO] Downloading %s from bucket %s...' % (s3_folder, aws_bucket))

    def get_all_s3_objects(s3, **base_kwargs):
        continuation_token = None
        while True:
            list_kwargs = dict(MaxKeys=1000, **base_kwargs)
            if continuation_token:
                list_kwargs['ContinuationToken'] = continuation_token
            response = s3.list_objects_v2(**list_kwargs)
            yield from response.get('Contents', [])
            if not response.get('IsTruncated'):
                break
            continuation_token = response.get('NextContinuationToken')

    s3_client = boto3.client('s3',
                             aws_access_key_id=aws_access_key_id,
                             aws_secret_access_key=aws_secret_access_key)

    all_s3_objects_gen = get_all_s3_objects(s3_client, Bucket=aws_bucket)

    for obj in all_s3_objects_gen:
        source = obj['Key']
        if source.startswith(s3_folder):
            destination = path.join(local_dir, source)
            if not path.exists(path.dirname(destination)):
                makedirs(path.dirname(destination))
            try:
                s3_client.download_file(aws_bucket, source, destination)
            except (ClientError, S3TransferFailedError) as e:
                print('[ERROR] Could not download file "%s": %s' % (source, e))
                success = False
            if debug_en:
                print('[DEBUG] Downloading: %s --> %s' % (source, destination))

    return success