Python 通过SFTP将一个文件并行复制到多个远程主机_Python_Python 3.x_Sftp_Paramiko_Python Asyncio

Python 通过SFTP将一个文件并行复制到多个远程主机

python python-3.x

Python 通过SFTP将一个文件并行复制到多个远程主机,python,python-3.x,sftp,paramiko,python-asyncio,Python,Python 3.x,Sftp,Paramiko,Python Asyncio,我想使用Python将本地文件并行复制到多个远程主机上。我正试图用asyncio和Paramiko实现这一点，因为我已经在程序中使用这些库用于其他目的我正在使用和默认的ThreadPoolExecutor，它实际上是旧threading库的新接口，以及Paramiko的SFTP功能来执行复制下面是一个简单的例子 import sys import asyncio import paramiko import functools def copy_file_node( *,

我想使用Python将本地文件并行复制到多个远程主机上。我正试图用

asyncio

和Paramiko实现这一点，因为我已经在程序中使用这些库用于其他目的

我正在使用和默认的

ThreadPoolExecutor

，它实际上是旧

threading

库的新接口，以及Paramiko的SFTP功能来执行复制

下面是一个简单的例子

import sys
import asyncio
import paramiko
import functools


def copy_file_node(
        *,
        user: str,
        host: str,
        identity_file: str,
        local_path: str,
        remote_path: str):
    ssh_client = paramiko.client.SSHClient()
    ssh_client.load_system_host_keys()
    ssh_client.set_missing_host_key_policy(paramiko.client.AutoAddPolicy())

    ssh_client.connect(
        username=user,
        hostname=host,
        key_filename=identity_file,
        timeout=3)

    with ssh_client:
        with ssh_client.open_sftp() as sftp:
            print("[{h}] Copying file...".format(h=host))
            sftp.put(localpath=local_path, remotepath=remote_path)
            print("[{h}] Copy complete.".format(h=host))


loop = asyncio.get_event_loop()

tasks = []

# NOTE: You'll have to update the values being passed in to
#      `functools.partial(copy_file_node, ...)`
#       to get this working on on your machine.
for host in ['10.0.0.1', '10.0.0.2']:
    task = loop.run_in_executor(
        None,
        functools.partial(
            copy_file_node,
            user='user',
            host=host,
            identity_file='/path/to/identity_file',
            local_path='/path/to/local/file',
            remote_path='/path/to/remote/file'))
    tasks.append(task)

try:
    loop.run_until_complete(asyncio.gather(*tasks))
except Exception as e:
    print("At least one node raised an error:", e, file=sys.stderr)
    sys.exit(1)

loop.close()

我看到的问题是，文件被串行复制到主机上，而不是并行复制。因此，如果单个主机的拷贝需要5秒，那么两个主机的拷贝需要10秒，依此类推

我尝试过其他各种方法，包括放弃SFTP和通过管道将文件传输到每个远程主机上的

dd

，但拷贝总是连续发生

我可能误解了一些基本的想法。是什么阻止不同线程并行复制文件

从我的测试来看，阻塞似乎发生在远程写入上，而不是读取本地文件上。但是为什么会这样呢，因为我们正在尝试对独立的远程主机进行网络I/O？

我不确定这是最好的方法，但它对我来说是有效的

#start
from multiprocessing import Process

#omitted

tasks = []
for host in hosts:
    p = Process(
        None,
        functools.partial(
          copy_file_node,
          user=user,
          host=host,
          identity_file=identity_file,
          local_path=local_path,
          remote_path=remote_path))

    tasks.append(p)

[t.start() for t in tasks]
[t.join() for t in tasks]

根据注释，添加了一个日期戳并捕获了多处理的输出，得到以下结果：

2015-10-24 03:06:08.749683[vagrant1] Copying file...
2015-10-24 03:06:08.751826[basement] Copying file...
2015-10-24 03:06:08.757040[upstairs] Copying file...
2015-10-24 03:06:16.222416[vagrant1] Copy complete.
2015-10-24 03:06:18.094373[upstairs] Copy complete.
2015-10-24 03:06:22.478711[basement] Copy complete.

使用asyncio没有什么问题。

为了证明这一点，让我们尝试一下脚本的简化版本——不，只是纯Python

import asyncio, functools, sys, time

START_TIME = time.monotonic()

def log(msg):
    print('{:>7.3f} {}'.format(time.monotonic() - START_TIME, msg))

def dummy(thread_id):
    log('Thread {} started'.format(thread_id))
    time.sleep(1)
    log('Thread {} finished'.format(thread_id))

loop = asyncio.get_event_loop()
tasks = []
for i in range(0, int(sys.argv[1])):
    task = loop.run_in_executor(None, functools.partial(dummy, thread_id=i))
    tasks.append(task)
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

使用两个线程，将打印：

$ python3 async.py 2
  0.001 Thread 0 started
  0.002 Thread 1 started       <-- 2 tasks are executed concurrently
  1.003 Thread 0 finished
  1.003 Thread 1 finished      <-- Total time is 1 second

$python3 async.py 2
0.001线程0已启动
0.002线程1启动可能paramiko
正在内部使用一些锁。您是否尝试了ProcessPoolExecutor
？我用一些伪代码替换了copy_file_node（）
，它工作正常，所以我认为是paramiko
阻止了并发。如果是这种情况，ProcessPoolExecutor
应该可以解决这个问题。你能发布你代码的ProcessPoolExecutor
版本吗？@NickChammas你确定网络带宽不是瓶颈吗？@NickChammas尝试同时通过scp手动将该文件复制到两台主机上，看看需要多长时间。@AlexanderLukani13-实际上，关于带宽，你可能是对的。如果尝试两个单独的scp
进程，一个进程总是在约23秒内完成，而另一个进程则需要约38秒。哇！所以，除了我对自己环境的假设之外，也许没什么错…：）我会试试看。然而，这在功能上不应该等同于将ProcessPoolExecutor
与asyncio
一起使用吗？我认为是这样的，因为基本API相同，但不必通读这两个API的全部源代码，我也会冒类似“它们实现的东西略有不同”的风险，因为您使用的是futures，它回避了一个问题，即您使用的python和各种模块的确切版本。问题被标记为3.x，但futures建议您使用带有3.x后端口的2.x，或早期的3.x，或其他一些旧模块。可以想象，您在后端口中遇到了模块版本/未处理边缘案例之间的奇怪交互。我使用了Python3.4.3和最近的其他工具，我使用的是Python3.5.0和Paramiko 1.15.3concurrent.futures
是在asyncio
上解释可以向run\u-in\u-executor（）提供哪些执行器时的参考。不管怎样，让我来试一试，看看有没有区别。顺便说一句，当你说它对你有用时，你是否将一个足够大的文件复制到两个远程主机上，以注意它们是串行还是并行上传的？尝试向输出添加一个日期戳或分析线程。由于线程化的东西返回的方式，它不一定按照在多个线程中创建输出的相同顺序打印输出。你可能真的得到了同步传输，却不知道！如果它的带宽，因为你的远程设备在ec2中，你总是可以将它复制到其中一个ec2主机上，然后从那里扇出到其他ec2主机上，或者将文件发送到s3并从那里将数据拉到主机上。而且，正如你在关于这个问题的评论中帮助我看到的，我看到的明显的串行上传背后的原因仅仅是我的上传带宽！
$ python3 async.py 5
  0.001 Thread 0 started
  ...
  0.003 Thread 4 started       <-- 5 tasks are executed concurrently
  1.002 Thread 0 finished
  ...
  1.005 Thread 4 finished      <-- Total time is still 1 second

$ python3 async.py 6
  0.001 Thread 0 started
  0.001 Thread 1 started
  0.002 Thread 2 started
  0.003 Thread 3 started
  0.003 Thread 4 started       <-- 5 tasks are executed concurrently
  1.002 Thread 0 finished
  1.003 Thread 5 started       <-- 6th task is executed after 1 second
  1.003 Thread 1 finished
  1.004 Thread 2 finished
  1.004 Thread 3 finished
  1.004 Thread 4 finished      <-- 5 task are completed after 1 second
  2.005 Thread 5 finished      <-- 6th task is completed after 2 seconds