Python 使Boto3上传呼叫阻塞（单线程）_Python_Boto3

Python 使Boto3上传呼叫阻塞（单线程）

python

Python 使Boto3上传呼叫阻塞（单线程）,python,boto3,Python,Boto3,编辑：我最初的假设被证明部分错误。我在这里添加了一个冗长的答案，请其他人进行压力测试和纠正我正在寻找一种以单线程方式利用Boto3 S3API来模拟线程安全键值存储的方法。简而言之，我想使用调用线程而不是新线程来上传据我所知，Boto3（或.upload_file（））中方法的默认行为是将任务启动到新线程，并立即返回None 从：这是一个托管传输，必要时将在多个线程中执行多部分上载（如果我对这一点的理解一开始是错误的，那么对其进行更正也会很有帮助。这在Boto3 1.9.134中。）

编辑：我最初的假设被证明部分错误。我在这里添加了一个冗长的答案，请其他人进行压力测试和纠正

我正在寻找一种以单线程方式利用Boto3 S3API来模拟线程安全键值存储的方法。简而言之，我想使用调用线程而不是新线程来上传
据我所知，Boto3（或
.upload_file（）
）中方法的默认行为是将任务启动到新线程，并立即返回
None
从：
这是一个托管传输，必要时将在多个线程中执行多部分上载
（如果我对这一点的理解一开始是错误的，那么对其进行更正也会很有帮助。这在Boto3 1.9.134中。）
现在，让我们假设
buf
不是一个短的4字节字符串，而是一个巨大的文本块，需要花费不可忽略的时间才能完全上传
我还使用此函数检查具有给定键的对象是否存在：

def key_exists_in_bucket(bucket_obj, key: str) -> bool: try: bucket_obj.Object(key).load() except botocore.exceptions.ClientError: return False else: return True
我的意图是，如果对象按名称存在，则不重写该对象
这里的竞争条件相当明显：异步启动上载，然后使用
键\u exists\u in_bucket（）
进行快速检查，如果对象仍在写入，则返回
False
，然后不必要地再次写入
是否有办法确保当前线程调用
bucket.upload\u fileobj（）
，而不是在该方法范围内创建的新线程？
我意识到这会让事情变慢。在这种情况下，我愿意牺牲速度。
需要一个配置参数。这是一个对象，它又有一个名为
use\u threads
（默认为true）的参数-如果为true，则在执行S3传输时将使用线程。如果为False，则不会在执行传输时使用任何线程：所有逻辑都将在主线程中运行

希望这对您有用。
测试方法是否阻塞：
我自己对这种行为进行了经验测试。首先，我生成了一个100MB的文件，其中包含：

dd if=/dev/zero of=100mb.txt bs=100M count=1
然后，我尝试以与您相同的方式上载文件，并测量所用的时间：

import boto3 import time import io file = open('100mb.txt', 'rb') buf = io.BytesIO(file.read()) bucket = boto3.resource('s3').Bucket('testbucket') start = time.time() print("starting to upload...") bucket.upload_fileobj(buf, '100mb') print("finished uploading") end = time.time() print("time: {}".format(end-start))
upload_fileobj（）方法完成和读取下一个python行（1gb文件为50秒）花费了超过8秒的时间，因此我假设此方法正在阻塞
使用线程进行测试：
使用多线程时，我可以验证该方法是否同时支持多个传输，即使使用选项use\u threads=False也不例外。我开始上传一个200mb的文件，然后上传一个100mb的文件，100mb的文件首先完成。这确认TransferConfig中的并发性与多部分传输相关
代码：
输出：
开始上传文件200mb.txt
开始上传100mb.txt文件
已完成上载文件100mb.txt。时间：46.35254502296448
已完成上载文件200mb.txt。时间：61.70564889907837
使用会话进行测试：
如果希望上传方法按调用顺序完成，则需要这样做
代码：
输出：
开始上传文件200mb.txt
开始上传100mb.txt文件
已完成上载文件200mb.txt。时间：46.62478971481323
已完成上载文件100mb.txt。时间：50.5159502941895
我找到的一些资源：
-这里有一个关于方法是阻塞还是非阻塞的问题。这不是结论性的，但其中可能包含相关信息。
-GitHub有一个开放平台，允许boto3中的同步传输。
-还有一些工具，如和专门用于允许从s3和其他aws服务异步下载和上传
关于我以前的回答：
您可以在boto3中阅读有关文件传输配置的信息。特别是：
传输操作使用线程来实现并发性。线程使用可以通过将“使用线程”属性设置为False来禁用
最初我认为这与并发执行的多个传输有关。但是，在使用TransferConfig时读取参数max_concurrency中的注释说明，并发性不是指多个传输，而是指
“将请求执行传输的线程数”。所以它是用来加速传输的。use_threads属性仅用于允许多部分传输中的并发性。
我认为，由于这个问题的答案和答案似乎都存在直接冲突，因此最好直接找到源代码
总结

boto3
默认情况下使用多个线程（10）

但是，它不是异步的，因为它在返回之前等待（加入）这些线程，而不是使用“触发并忘记”技术

因此，通过这种方式，如果您试图从多个客户端与一个s3存储桶通信，那么读/写线程安全就已经就位

细节我在这里努力解决的一个方面是，多（子线程）并不意味着顶级方法本身是非阻塞的：如果调用线程开始上传到多个子线程，但随后等待这些线程完成并返回，我敢说这仍然是一个阻塞调用。另一方面，如果方法调用在
asyncio
speak中是一个“fire-and-forget”调用。使用
线程
，这实际上取决于是否调用了
x.join（）
以下是启动调试器的初始代码，取自Victor Val：

import io import pdb import boto3 # From dd if=/dev/zero of=100mb.txt bs=50M count=1 buf = io.BytesIO(open('100mb.txt', 'rb').read()) bucket = boto3.resource('s3').Bucket('test-threads') pdb.run("bucket.upload_fileobj(buf, '100mb')")
此堆栈帧来自Boto 1.9.134
现在跳到
pdb
：

。上传\u文件 import boto3 import time import io from boto3.s3.transfer import TransferConfig import threading config = TransferConfig(use_threads=False) bucket = boto3.resource('s3').Bucket('testbucket') def upload(filename): file = open(filename, 'rb') buf = io.BytesIO(file.read()) start = time.time() print("starting to upload file {}".format(filename)) bucket.upload_fileobj(buf,filename,Config=config) end = time.time() print("finished uploading file {}. time: {}".format(filename,end-start)) x1 = threading.Thread(target=upload, args=('200mb.txt',)) x2 = threading.Thread(target=upload, args=('100mb.txt',)) x1.start() time.sleep(2) x2.start() import boto3 import time import io from boto3.s3.transfer import TransferConfig import threading config = TransferConfig(use_threads=False) session = boto3.session.Session() s3 = session.resource('s3') bucket = s3.Bucket('testbucket') def upload(filename): file = open(filename, 'rb') buf = io.BytesIO(file.read()) start = time.time() print("starting to upload file {}".format(filename)) bucket.upload_fileobj(buf,filename) end = time.time() print("finished uploading file {}. time: {}".format(filename,end-start)) x1 = threading.Thread(target=upload, args=('200mb.txt',)) x2 = threading.Thread(target=upload, args=('100mb.txt',)) x1.start() time.sleep(2) x2.start() import io import pdb import boto3 # From dd if=/dev/zero of=100mb.txt bs=50M count=1 buf = io.BytesIO(open('100mb.txt', 'rb').read()) bucket = boto3.resource('s3').Bucket('test-threads') pdb.run("bucket.upload_fileobj(buf, '100mb')") (Pdb) s --Call-- > /home/ubuntu/envs/py372/lib/python3.7/site-packages/boto3/s3/inject.py(542)bucket_upload_fileobj() -> def bucket_upload_fileobj(self, Fileobj, Key, ExtraArgs=None, (Pdb) s (Pdb) l 574 575 :type Config: boto3.s3.transfer.TransferConfig 576 :param Config: The transfer configuration to be used when performing the 577 upload. 578 """ 579 -> return self.meta.client.upload_fileobj( 580 Fileobj=Fileobj, Bucket=self.name, Key=Key, ExtraArgs=ExtraArgs, 581 Callback=Callback, Config=Config) 582 583 584 (Pdb) l 531 526 527 subscribers = None 528 if Callback is not None: 529 subscribers = [ProgressCallbackInvoker(Callback)] 530 531 config = Config 532 if config is None: 533 config = TransferConfig() 534 535 with create_transfer_manager(self, config) as manager: 536 future = manager.upload( (Pdb) unt 534 > /home/ubuntu/envs/py372/lib/python3.7/site-packages/boto3/s3/inject.py(535)upload_fileobj() -> with create_transfer_manager(self, config) as manager: (Pdb) config <boto3.s3.transfer.TransferConfig object at 0x7f1790dc0cc0> (Pdb) config.use_threads True (Pdb) config.max_concurrency 10 # https://github.com/boto/s3transfer/blob/2aead638c8385d8ae0b1756b2de17e8fad45fffa/s3transfer/manager.py#L223 # The executor responsible for making S3 API transfer requests self._request_executor = BoundedExecutor( max_size=self._config.max_request_queue_size, max_num_threads=self._config.max_request_concurrency, tag_semaphores={ IN_MEMORY_UPLOAD_TAG: TaskSemaphore( self._config.max_in_memory_upload_chunks), IN_MEMORY_DOWNLOAD_TAG: SlidingWindowSemaphore( self._config.max_in_memory_download_chunks) }, executor_cls=executor_cls ) (Pdb) n > /home/ubuntu/envs/py372/lib/python3.7/site-packages/boto3/s3/inject.py(536)upload_fileobj() -> future = manager.upload( (Pdb) manager <s3transfer.manager.TransferManager object at 0x7f178db437f0> (Pdb) manager._config <boto3.s3.transfer.TransferConfig object at 0x7f1790dc0cc0> (Pdb) manager._config.use_threads True (Pdb) manager._config.max_concurrency 10 (Pdb) l 290, 303 290 -> if extra_args is None: 291 extra_args = {} 292 if subscribers is None: 293 subscribers = [] 294 self._validate_all_known_args(extra_args, self.ALLOWED_UPLOAD_ARGS) 295 call_args = CallArgs( 296 fileobj=fileobj, bucket=bucket, key=key, extra_args=extra_args, 297 subscribers=subscribers 298 ) 299 extra_main_kwargs = {} 300 if self._bandwidth_limiter: 301 extra_main_kwargs['bandwidth_limiter'] = self._bandwidth_limiter 302 return self._submit_transfer( 303 call_args, UploadSubmissionTask, extra_main_kwargs) (Pdb) unt 301 > /home/ubuntu/envs/py372/lib/python3.7/site-packages/s3transfer/manager.py(302)upload() -> return self._submit_transfer( (Pdb) extra_main_kwargs {} (Pdb) UploadSubmissionTask <class 's3transfer.upload.UploadSubmissionTask'> (Pdb) call_args <s3transfer.utils.CallArgs object at 0x7f178db5a5f8> (Pdb) l 300, 5 300 if self._bandwidth_limiter: 301 extra_main_kwargs['bandwidth_limiter'] = self._bandwidth_limiter 302 -> return self._submit_transfer( 303 call_args, UploadSubmissionTask, extra_main_kwargs) 304 305 def download(self, bucket, key, fileobj, extra_args=None, (Pdb) s > /home/ubuntu/envs/py372/lib/python3.7/site-packages/s3transfer/manager.py(303)upload() -> call_args, UploadSubmissionTask, extra_main_kwargs) (Pdb) s --Call-- > /home/ubuntu/envs/py372/lib/python3.7/site-packages/s3transfer/manager.py(438)_submit_transfer() -> def _submit_transfer(self, call_args, submission_task_cls, (Pdb) s > /home/ubuntu/envs/py372/lib/python3.7/site-packages/s3transfer/manager.py(440)_submit_transfer() -> if not extra_main_kwargs: (Pdb) l 440, 10 440 -> if not extra_main_kwargs: 441 extra_main_kwargs = {} 442 443 # Create a TransferFuture to return back to the user 444 transfer_future, components = self._get_future_with_components( 445 call_args) 446 447 # Add any provided done callbacks to the created transfer future 448 # to be invoked on the transfer future being complete. 449 for callback in get_callbacks(transfer_future, 'done'): 450 components['coordinator'].add_done_callback(callback) (Pdb) l 444 transfer_future, components = self._get_future_with_components( 445 call_args) 446 447 # Add any provided done callbacks to the created transfer future 448 # to be invoked on the transfer future being complete. 449 -> for callback in get_callbacks(transfer_future, 'done'): 450 components['coordinator'].add_done_callback(callback) 451 452 # Get the main kwargs needed to instantiate the submission task 453 main_kwargs = self._get_submission_task_main_kwargs( 454 transfer_future, extra_main_kwargs) (Pdb) transfer_future <s3transfer.futures.TransferFuture object at 0x7f178db5a780> class TransferCoordinator(object): """A helper class for managing TransferFuture""" def __init__(self, transfer_id=None): self.transfer_id = transfer_id self._status = 'not-started' self._result = None self._exception = None self._associated_futures = set() self._failure_cleanups = [] self._done_callbacks = [] self._done_event = threading.Event() # < ------ !!!!!! class BoundedExecutor(object): EXECUTOR_CLS = futures.ThreadPoolExecutor # ... def __init__(self, max_size, max_num_threads, tag_semaphores=None, executor_cls=None): self._max_num_threads = max_num_threads if executor_cls is None: executor_cls = self.EXECUTOR_CLS self._executor = executor_cls(max_workers=self._max_num_threads) from concurrent import futures _executor = futures.ThreadPoolExecutor(max_workers=10) # https://github.com/boto/s3transfer/blob/2aead638c8385d8ae0b1756b2de17e8fad45fffa/s3transfer/futures.py#L249 def result(self): self._done_event.wait(MAXINT) # Once done waiting, raise an exception if present or return the # final result. if self._exception: raise self._exception return self._result >>> import boto3 >>> import time >>> import io >>> >>> buf = io.BytesIO(open('100mb.txt', 'rb').read()) >>> >>> bucket = boto3.resource('s3').Bucket('test-threads') >>> start = time.time() >>> print("starting to upload...") starting to upload... >>> bucket.upload_fileobj(buf, '100mb') >>> print("finished uploading") finished uploading >>> end = time.time() >>> print("time: {}".format(end-start)) time: 2.6030001640319824 def get_bufsize(buf, chunk=1024) -> int: start = buf.tell() try: size = 0 while True: out = buf.read(chunk) if out: size += chunk else: break return size finally: buf.seek(start) import os import sys import threading import time class ProgressPercentage(object): def __init__(self, filename, buf): self._filename = filename self._size = float(get_bufsize(buf)) self._seen_so_far = 0 self._lock = threading.Lock() self.start = None def __call__(self, bytes_amount): with self._lock: if not self.start: self.start = time.monotonic() self._seen_so_far += bytes_amount percentage = (self._seen_so_far / self._size) * 100 sys.stdout.write( "\r%s %s of %s (%.2f%% done, %.2fs elapsed\n" % ( self._filename, self._seen_so_far, self._size, percentage, time.monotonic() - self.start)) # Use sys.stdout.flush() to update on one line # sys.stdout.flush() In [19]: import io ...: ...: from boto3.session import Session ...: ...: s3 = Session().resource("s3") ...: bucket = s3.Bucket("test-threads") ...: buf = io.BytesIO(open('100mb.txt', 'rb').read()) ...: ...: bucket.upload_fileobj(buf, 'mykey', Callback=ProgressPercentage("mykey", buf)) mykey 262144 of 104857600.0 (0.25% done, 0.00s elapsed mykey 524288 of 104857600.0 (0.50% done, 0.00s elapsed mykey 786432 of 104857600.0 (0.75% done, 0.01s elapsed mykey 1048576 of 104857600.0 (1.00% done, 0.01s elapsed mykey 1310720 of 104857600.0 (1.25% done, 0.01s elapsed mykey 1572864 of 104857600.0 (1.50% done, 0.02s elapsed