Python 如何高效地链接i并行任务并将中间结果传递给引擎?

Python 如何高效地链接i并行任务并将中间结果传递给引擎?,python,ipython-parallel,Python,Ipython Parallel,我正在尝试在iPyParallel中将多个任务链接在一起,例如 import ipyparallel client = ipyparallel.Client() view = client.load_balanced_view() def task1(x): ## Do some work. return x * 2 def task2(x): ## Do some work. return x * 3 def task3(x): ## Do some w

我正在尝试在iPyParallel中将多个任务链接在一起,例如

import ipyparallel
client = ipyparallel.Client()
view = client.load_balanced_view()
def task1(x):
    ## Do some work.
    return x * 2
def task2(x):
    ## Do some work.
    return x * 3
def task3(x):
    ## Do some work.
    return x * 4
results1 = view.map_async(task1, [1, 2, 3])
results2 = view.map_async(task2, results1.get())
results3 = view.map_async(task3, results2.get())
但是,除非task1完成并且基本上处于阻塞状态,否则此代码不会提交任何task2。我的任务可能需要不同的时间,而且效率很低是否有一种简单的方法可以有效地链接这些步骤,使引擎可以从前面的步骤中获得结果?类似于:

def task2(x):
    ## Do some work.
    return x.get() * 3 ## Get AsyncResult out.
def task3(x):
    ## Do some work.
    return x.get() * 4 ## Get AsyncResult out.
results1 = [view.apply_async(task1, x) for x in [1, 2, 3]]
results2 = []
for x in result1:
    view.set_flags(after=x.msg_ids)
    results2.append(view.apply_async(task2, x))
results3 = []
for x in result2:
    view.set_flags(after=x.msg_ids)
    results3.append(view.apply_async(task3, x))
显然,这将失败,因为AsyncResult不可拾取

我在考虑一些解决方案:

  • 使用view.map\u async(ordered=False)

    但这必须等待所有task1完成,然后才能提交任何task3。它仍在阻塞

  • 使用asyncio

    @asyncio.coroutine
    def submitter(x):
        result1 = yield from asyncio.wrap_future(view.apply_async(task1, x))
        result2 = yield from asyncio.wrap_future(view.apply_async(task2, result1)
        result3 = yield from asyncio.wrap_future(view.apply_async(task3, result2)
        yield result3
    
    @asyncio.coroutine
    def submit_all(ls):
        jobs = [submitter(x) for x in ls]
        results = []
        for async_r in asyncio.as_completed(jobs):
            r = yield from async_r
            results.append(r)
        ## Do some work, like analysing results.
    
    它正在工作,但当引入更复杂的任务时,代码很快就会变得混乱和不直观

  • 感谢您的帮助。

    选项1:连锁期货 IPython parallel在这方面不是最好的,因为连接必须在客户机级别完成。在提交结果之前,您必须等待结果完成并返回到客户端。本质上,asyncio submit_all是IPython并行的正确方法。您可以编写一个
    chain
    函数,该函数使用
    add\u done\u callback
    在上一个任务完成时提交新任务,从而获得更通用的功能:

    from concurrent.futures import Future
    from functools import partial
    
    
    def chain_apply(view, func, future):
        """Chain a call to view.apply(func, future.result()) when future is ready.
    
        Returns a Future for the subsequent result.
        """
        f2 = Future()
        # when f1 is ready, submit a new task for func on its result
        def apply_func(f):
            if f.exception():
                f2.set_exception(f.exception())
                return
            print('submitting %s(%s)' % (func.__name__, f.result()))
            ar = view.apply_async(func, f.result())
            # when ar is done, pass through the result to f2
            ar.add_done_callback(lambda ar: f2.set_result(ar.get()))
    
        future.add_done_callback(apply_func)
        return f2
    
    
    def chain_map(view, func, list_of_futures):
        """Chain a new callback on a list of futures."""
        return [ chain_apply(view, func, f) for f in list_of_futures ]
    
    # use builtin map with apply, since we want one Future per item
    results1 = map(partial(view.apply, task1), [1, 2, 3])
    results2 = chain_map(view, task2, results1)
    results3 = chain_map(view, task3, results2)
    print("Waiting for results")
    [ r.result() for r in results3 ]
    
    与任何
    add\u done\u callback
    的示例一样,它可以用协同程序编写,但我发现这种情况下的回调是可以的。这至少应该是一个相当通用的实用程序,您可以使用它来编写管道

    选项2:dask.distributed 全面披露:我是IPython Parallel的主要作者,我将建议您使用不同的工具

    <> P>可以通过IPython并行的引擎命名空间和DAG依赖性将一个任务的结果传递给另一个任务,但是老实说,如果你的工作流程看起来像这样,你应该考虑使用它,它是专门为这种计算图设计的。如果您已经熟悉IPython parallel,那么开始使用dask应该不会有太大的负担

    IPython 5.1提供了一个方便的命令,用于将IPython并行群集转换为dask分布式群集:

    import ipyparallel as ipp
    client = ipp.Client()
    executor = client.become_distributed(ncores=1)
    
    然后,dask的关键相关特性是,您可以将未来作为参数提交给后续的映射调用,当结果准备好时,调度程序会处理它,而不必在客户机中显式执行:

    results1 = executor.map(task1, [1, 2, 3])
    results2 = executor.map(task2, results1)
    results3 = executor.map(task3, results2)
    executor.gather(results3)
    
    因此,基本上,dask distributed的工作方式与您希望的IPython parallel的负载平衡方式相同,当您需要这样的链接时


    演示了这两个示例。

    在使用了这些选项之后,我认为我仍然会使用asyncio和包装类来保持代码整洁。我对dask不太熟悉,把它拖到项目中只是为了监控结果,而并行部分已经由IPyParallel完成,这感觉有些过分。
    results1 = executor.map(task1, [1, 2, 3])
    results2 = executor.map(task2, results1)
    results3 = executor.map(task3, results2)
    executor.gather(results3)