如何将IPython.parallel map()与生成器一起用作函数的输入
我正在尝试使用IPython.parallel映射。我希望并行化的函数的输入是生成器。由于大小/内存的原因,我无法将生成器转换为列表。见下面的代码:如何将IPython.parallel map()与生成器一起用作函数的输入,python,multiprocessing,ipython,ipython-parallel,Python,Multiprocessing,Ipython,Ipython Parallel,我正在尝试使用IPython.parallel映射。我希望并行化的函数的输入是生成器。由于大小/内存的原因,我无法将生成器转换为列表。见下面的代码: from itertools import product from IPython.parallel import Client c = Client() v = c[:] c.ids def stringcount(longstring, substrings): scount = [longstring.count(s) for
from itertools import product
from IPython.parallel import Client
c = Client()
v = c[:]
c.ids
def stringcount(longstring, substrings):
scount = [longstring.count(s) for s in substrings]
return scount
substrings = product('abc', repeat=2)
longstring = product('abc', repeat=3)
# This is what I want to do in parallel
# I should be 'for longs in longstring' I use range() because it can get long.
for num in range(10):
longs = longstring.next()
subs = substrings.next()
print(subs, longs)
count = stringcount(longs, subs)
print(count)
# This does not work, and I understand why.
# I don't know how to fix it while keeping longstring and substrings as
# generators
v.map(stringcount, longstring, substrings)
for r in v:
print(r.get())
如果不先遍历整个生成器,则无法将View.map与生成器一起使用。但是,您可以编写自己的自定义函数来提交生成器中的成批任务,并以增量方式等待它们。我没有一个更有趣的例子,但我可以用一个基本搜索的糟糕实现来说明
从我们的令牌“数据生成器”开始:
from math import sqrt
def generate_possible_factors(N):
"""generator for iterating through possible factors for N
yields 2, every odd integer <= sqrt(N)
"""
if N <= 3:
return
yield 2
f = 3
last = int(sqrt(N))
while f <= last:
yield f
f += 2
以及使用生成器和我们的因子函数的基本检查的完整实现:
def dumb_prime(N):
"""dumb implementation of is N prime?"""
for f in generate_possible_factors(N):
if is_factor(f, N):
return False
return True
from IPython import parallel
rc = parallel.Client()
view = rc.load_balanced_view()
for N in range(900,1000):
if parallel_dumb_prime(N, view, 10):
print N
一次只提交有限数量任务的并行版本:
def parallel_dumb_prime(N, v, max_outstanding=10, dt=0.1):
"""dumb_prime where each factor is checked remotely
Up to `max_outstanding` factors will be checked in parallel.
Submission will halt as soon as we know that N is not prime.
"""
tasks = set()
# factors is a generator
factors = generate_possible_factors(N)
while True:
try:
# submit a batch of tasks, with a maximum of `max_outstanding`
for i in range(max_outstanding-len(tasks)):
f = factors.next()
tasks.add(v.apply_async(is_factor, f, N))
except StopIteration:
# no more factors to test, stop submitting
break
# get the tasks that are done
ready = set(task for task in tasks if task.ready())
while not ready:
# wait a little bit for some tasks to finish
v.wait(tasks, timeout=dt)
ready = set(task for task in tasks if task.ready())
for t in ready:
# get the result - if True, N is not prime, we are done
if t.get():
return False
# update tasks to only those that are still pending,
# and submit the next batch
tasks.difference_update(ready)
# check the last few outstanding tasks
for task in tasks:
if t.get():
return False
# checked all candidates, none are factors, so N is prime
return True
它一次提交有限数量的任务,一旦我们知道N不是素数,我们就停止使用生成器
要使用此功能,请执行以下操作:
def dumb_prime(N):
"""dumb implementation of is N prime?"""
for f in generate_possible_factors(N):
if is_factor(f, N):
return False
return True
from IPython import parallel
rc = parallel.Client()
view = rc.load_balanced_view()
for N in range(900,1000):
if parallel_dumb_prime(N, view, 10):
print N
更完整的说明。如果不先浏览整个生成器,则无法将View.map与生成器一起使用。但是,您可以编写自己的自定义函数来提交生成器中的成批任务,并以增量方式等待它们。我没有一个更有趣的例子,但我可以用一个基本搜索的糟糕实现来说明
从我们的令牌“数据生成器”开始:
from math import sqrt
def generate_possible_factors(N):
"""generator for iterating through possible factors for N
yields 2, every odd integer <= sqrt(N)
"""
if N <= 3:
return
yield 2
f = 3
last = int(sqrt(N))
while f <= last:
yield f
f += 2
以及使用生成器和我们的因子函数的基本检查的完整实现:
def dumb_prime(N):
"""dumb implementation of is N prime?"""
for f in generate_possible_factors(N):
if is_factor(f, N):
return False
return True
from IPython import parallel
rc = parallel.Client()
view = rc.load_balanced_view()
for N in range(900,1000):
if parallel_dumb_prime(N, view, 10):
print N
一次只提交有限数量任务的并行版本:
def parallel_dumb_prime(N, v, max_outstanding=10, dt=0.1):
"""dumb_prime where each factor is checked remotely
Up to `max_outstanding` factors will be checked in parallel.
Submission will halt as soon as we know that N is not prime.
"""
tasks = set()
# factors is a generator
factors = generate_possible_factors(N)
while True:
try:
# submit a batch of tasks, with a maximum of `max_outstanding`
for i in range(max_outstanding-len(tasks)):
f = factors.next()
tasks.add(v.apply_async(is_factor, f, N))
except StopIteration:
# no more factors to test, stop submitting
break
# get the tasks that are done
ready = set(task for task in tasks if task.ready())
while not ready:
# wait a little bit for some tasks to finish
v.wait(tasks, timeout=dt)
ready = set(task for task in tasks if task.ready())
for t in ready:
# get the result - if True, N is not prime, we are done
if t.get():
return False
# update tasks to only those that are still pending,
# and submit the next batch
tasks.difference_update(ready)
# check the last few outstanding tasks
for task in tasks:
if t.get():
return False
# checked all candidates, none are factors, so N is prime
return True
它一次提交有限数量的任务,一旦我们知道N不是素数,我们就停止使用生成器
要使用此功能,请执行以下操作:
def dumb_prime(N):
"""dumb implementation of is N prime?"""
for f in generate_possible_factors(N):
if is_factor(f, N):
return False
return True
from IPython import parallel
rc = parallel.Client()
view = rc.load_balanced_view()
for N in range(900,1000):
if parallel_dumb_prime(N, view, 10):
print N
更完整的说明。我对你的问题采取了一种稍微不同的方法,这可能对其他人有用。下面,我试图通过包装IPython.parallel.map来模拟multiprocessing.pool.pool.imap方法的行为。这需要我稍微重新编写您的函数
import IPython
from itertools import product
def stringcount((longstring, substrings)):
scount = [longstring.count(s) for s in substrings]
return (longstring, substrings, scount)
def gen_pairs(long_string, sub_strings):
for l in long_string:
s = sub_strings.next()
yield (l, s)
def imap(function, generator, view, preprocessor=iter, chunksize=256):
num_cores = len(view.client.ids)
queue = []
for i, n in enumerate(preprocessor(generator)):
queue.append(n)
if not i % (chunksize * num_cores):
for result in view.map(function, queue):
yield result
queue = []
for result in view.map(function, queue):
yield result
client = IPython.parallel.Client()
lbview = client.load_balanced_view()
longstring = product('abc', repeat=3)
substrings = product('abc', repeat=2)
for result in imap(stringcount, gen_pairs(longstring, substrings), lbview):
print result
我看到的结果在这个笔记本上:我对你的问题采取了一种稍微不同的方法,这可能对其他人有用。下面,我试图通过包装IPython.parallel.map来模拟multiprocessing.pool.pool.imap方法的行为。这需要我稍微重新编写您的函数
import IPython
from itertools import product
def stringcount((longstring, substrings)):
scount = [longstring.count(s) for s in substrings]
return (longstring, substrings, scount)
def gen_pairs(long_string, sub_strings):
for l in long_string:
s = sub_strings.next()
yield (l, s)
def imap(function, generator, view, preprocessor=iter, chunksize=256):
num_cores = len(view.client.ids)
queue = []
for i, n in enumerate(preprocessor(generator)):
queue.append(n)
if not i % (chunksize * num_cores):
for result in view.map(function, queue):
yield result
queue = []
for result in view.map(function, queue):
yield result
client = IPython.parallel.Client()
lbview = client.load_balanced_view()
longstring = product('abc', repeat=3)
substrings = product('abc', repeat=2)
for result in imap(stringcount, gen_pairs(longstring, substrings), lbview):
print result
我看到的输出在这个笔记本上:关于内存中可以存在多少项,您能更具体地说明您的要求吗?由于执行是异步的,如果你浏览一个生成器,你可能会在内存中有几乎所有的输入,除非你在提交新任务之前开始等待结果。由于我运行的是64位,我想我的限制是系统内存,即8GB或可以使用32GB的机器。例如,产品'abcd',repeat=10,变得非常大,基本上,一旦我根据计数找到满足我要求的结果,我就可以停止。我假设/希望我的地图会根据需要从发电机上获取。等待结果是可以的。关于内存中可以存在多少项,您能否更具体地说明您的需求?由于执行是异步的,如果你浏览一个生成器,你可能会在内存中有几乎所有的输入,除非你在提交新任务之前开始等待结果。由于我运行的是64位,我想我的限制是系统内存,即8GB或可以使用32GB的机器。例如,产品'abcd',repeat=10,变得非常大,基本上,一旦我根据计数找到满足我要求的结果,我就可以停止。我假设/希望我的地图会根据需要从发电机上获取。等待结果是好的。谢谢你的回答,我正试图腾出时间来看看。其他的东西变得更重要了。谢谢你的回答,我正试着抽出时间来看看。其他事情变得更加重要。