Python 通过boto3同步两个存储桶_Python_Amazon Web Services_Amazon S3_Boto3

Python 通过boto3同步两个存储桶

python amazon-web-services amazon-s3

Python 通过boto3同步两个存储桶,python,amazon-web-services,amazon-s3,boto3,Python,Amazon Web Services,Amazon S3,Boto3,是否有任何方法可以使用boto3在两个不同的bucket（源和目标）中循环bucket内容，如果它在源中发现任何与目标不匹配的密钥，则将其上载到目标bucket。请注意，我不想使用aws s3同步。我目前正在使用以下代码来完成此工作： import boto3 s3 = boto3.resource('s3') src = s3.Bucket('sourcenabcap') dst = s3.Bucket('destinationnabcap') objs = list(dst.objects

是否有任何方法可以使用boto3在两个不同的bucket（源和目标）中循环bucket内容，如果它在源中发现任何与目标不匹配的密钥，则将其上载到目标bucket。请注意，我不想使用aws s3同步。我目前正在使用以下代码来完成此工作：

import boto3

s3 = boto3.resource('s3')
src = s3.Bucket('sourcenabcap')
dst = s3.Bucket('destinationnabcap')
objs = list(dst.objects.all())
for k in src.objects.all():
 if (k.key !=objs[0].key):
  # copy the k.key to target

如果您只希望按键进行比较（忽略对象内的差异），可以使用以下方法：

s3 = boto3.resource('s3')
source_bucket = s3.Bucket('source')
destination_bucket = s3.Bucket('destination')
destination_keys = [object.key for object in destination_bucket.objects.all()]
for object in source_bucket.objects.all():
  if (object.key not in destination_keys):
    # copy object.key to destination

如果您决定不使用boto3。同步命令仍然不适用于boto3，因此您可以直接使用它

# python 3

import os

sync_command = f"aws s3 sync s3://source-bucket/ s3://destination-bucket/"
os.system(sync_command)

我刚刚实现了一个简单的类（将本地文件夹同步到bucket）。我把它贴在这里，希望它能帮助任何有同样问题的人

您可以修改S3Sync.sync以考虑文件大小

class S3Sync:
    """
    Class that holds the operations needed for synchronize local dirs to a given bucket.
    """

    def __init__(self):
        self._s3 = boto3.client('s3')

    def sync(self, source: str, dest: str) -> [str]:
        """
        Sync source to dest, this means that all elements existing in
        source that not exists in dest will be copied to dest.

        No element will be deleted.

        :param source: Source folder.
        :param dest: Destination folder.

        :return: None
        """

        paths = self.list_source_objects(source_folder=source)
        objects = self.list_bucket_objects(dest)

        # Getting the keys and ordering to perform binary search
        # each time we want to check if any paths is already there.
        object_keys = [obj['Key'] for obj in objects]
        object_keys.sort()
        object_keys_length = len(object_keys)
        
        for path in paths:
            # Binary search.
            index = bisect_left(object_keys, path)
            if index == object_keys_length:
                # If path not found in object_keys, it has to be sync-ed.
                self._s3.upload_file(str(Path(source).joinpath(path)),  Bucket=dest, Key=path)

    def list_bucket_objects(self, bucket: str) -> [dict]:
        """
        List all objects for the given bucket.

        :param bucket: Bucket name.
        :return: A [dict] containing the elements in the bucket.

        Example of a single object.

        {
            'Key': 'example/example.txt',
            'LastModified': datetime.datetime(2019, 7, 4, 13, 50, 34, 893000, tzinfo=tzutc()),
            'ETag': '"b11564415be7f58435013b414a59ae5c"',
            'Size': 115280,
            'StorageClass': 'STANDARD',
            'Owner': {
                'DisplayName': 'webfile',
                'ID': '75aa57f09aa0c8caeab4f8c24e99d10f8e7faeebf76c078efc7c6caea54ba06a'
            }
        }

        """
        try:
            contents = self._s3.list_objects(Bucket=bucket)['Contents']
        except KeyError:
            # No Contents Key, empty bucket.
            return []
        else:
            return contents

    @staticmethod
    def list_source_objects(source_folder: str) -> [str]:
        """
        :param source_folder:  Root folder for resources you want to list.
        :return: A [str] containing relative names of the files.

        Example:

            /tmp
                - example
                    - file_1.txt
                    - some_folder
                        - file_2.txt

            >>> sync.list_source_objects("/tmp/example")
            ['file_1.txt', 'some_folder/file_2.txt']

        """

        path = Path(source_folder)

        paths = []

        for file_path in path.rglob("*"):
            if file_path.is_dir():
                continue
            str_file_path = str(file_path)
            str_file_path = str_file_path.replace(f'{str(path)}/', "")
            paths.append(str_file_path)

        return paths


if __name__ == '__main__':
    sync = S3Sync()
    sync.sync("/temp/some_folder", "some_bucket_name")

另外，将

if-file\u-path.is\u-dir（）：

替换为

if-not-file\u-path.is\u-file（）：

可以让它绕过无法解析的链接和其他类似的废话，感谢@keithpjolley指出这一点

获取目标帐户ID目的地帐户ID

创建源bucket并添加此策略

{
“版本”：“2012-10-17”，
“声明”：[
{
“Sid”：“DelegateS3Access”，
“效果”：“允许”，
“委托人”：{
“AWS”：“arn:AWS:iam:：DEST\u ACCOUNT\u ID:root”
},
“行动”：[
“s3:ListBucket”，
“s3:GetObject”
],
“资源”：[
“arn:aws:s3:：：s3复制测试/*”，
“arn:aws:s3:：s3复制测试”
]
}
]
}

是的，这似乎很好，但由于目标中的对象位于文件夹（例如ABC）中，因此对象名称与源不同，因此我必须使用过滤器（Prefix='ABC/'）。例如，源中的对象名为name1，而目标中的对象名为ABC/name，您有没有办法使它们具有可比性？您可以在最后一个斜杠之前去掉字符串。不再建议使用

os.system（）

。改用

子流程

模块。请参阅，将

if-file\u-path.is\u-dir（）：

替换为

if-not-file\u-path.is\u-file（）：

可以绕过无法解析的链接和其他类似的废话。