为什么这个用于分布式计算的Python 0MQ脚本挂起在一个固定的输入大小上？_Python_Parallel Processing_Multiprocessing_Distributed Computing_Zeromq

为什么这个用于分布式计算的Python 0MQ脚本挂起在一个固定的输入大小上？

python parallel-processing

为什么这个用于分布式计算的Python 0MQ脚本挂起在一个固定的输入大小上？,python,parallel-processing,multiprocessing,distributed-computing,zeromq,Python,Parallel Processing,Multiprocessing,Distributed Computing,Zeromq,我最近开始学习。今天早些时候，我偶然发现了一个博客。在我读到的0MQ指南中谈到了它，所以我决定尝试一下我决定让呼吸机通过0mq消息向工人发送大数组，而不是像原始代码那样只计算工人的数字乘积。下面是我在“实验”中使用的代码正如下面的一条评论所指出的，每当我试图将变量string_length增加到大于3MB的数字时，代码都会挂起典型症状：假设我们将字符串长度设置为4MB（即4194304），那么结果管理器可能会从一个工作者那里获得结果，然后代码只是暂停。htop显示2个岩芯的作用不大。Eth

我最近开始学习。今天早些时候，我偶然发现了一个博客。在我读到的0MQ指南中谈到了它，所以我决定尝试一下

我决定让呼吸机通过0mq消息向工人发送大数组，而不是像原始代码那样只计算工人的数字乘积。下面是我在“实验”中使用的代码

正如下面的一条评论所指出的，每当我试图将变量string_length增加到大于3MB的数字时，代码都会挂起

典型症状：假设我们将字符串长度设置为4MB（即4194304），那么结果管理器可能会从一个工作者那里获得结果，然后代码只是暂停。htop显示2个岩芯的作用不大。Etherape网络流量监视器也显示lo接口上没有流量
到目前为止，环顾四周几个小时后，我还没有弄清楚是什么导致了这一问题，我希望能给你一两个提示，说明原因以及解决这一问题的方法。谢谢
我在戴尔笔记本电脑上运行Ubuntu11.04 64位，配备Intel Core due CPU、8GB RAM、80GB Intel X25MG2 SSD、Python 2.7.1+、libzmq1 2.1.10-1chl1~natty1、Python pyzmq 2.1.10-1chl1~natty1

import time import zmq from multiprocessing import Process, cpu_count np = cpu_count() pool_size = np number_of_elements = 128 # Odd, why once the slen is bumped to 3MB or above, the code hangs? string_length = 1024 * 1024 * 3 def create_inputs(nelem, slen, pb=True): ''' Generates an array that contains nelem fix-sized (of slen bytes) random strings and an accompanying array of hexdigests of the former's elements. Both are returned in a tuple. :type nelem: int :param nelem: The desired number of elements in the to be generated array. :type slen: int :param slen: The desired number of bytes of each array element. :type pb: bool :param pb: If True, displays a text progress bar during input array generation. ''' from os import urandom import sys import hashlib if pb: if nelem <= 64: toolbar_width = nelem chunk_size = 1 else: toolbar_width = 64 chunk_size = nelem // toolbar_width description = '%d random strings of %d bytes. ' % (nelem, slen) s = ''.join(('Generating an array of ', description, '...\n')) sys.stdout.write(s) # create an ASCII progress bar sys.stdout.write("[%s]" % (" " * toolbar_width)) sys.stdout.flush() sys.stdout.write("\b" * (toolbar_width+1)) array = list() hash4a = list() try: for i in range(nelem): e = urandom(int(slen)) array.append(e) h = hashlib.md5() h.update(e) he = h.hexdigest() hash4a.append(he) i += 1 if pb and i and i % chunk_size == 0: sys.stdout.write("-") sys.stdout.flush() if pb: sys.stdout.write("\n") except MemoryError: print('Memory Error: discarding existing arrays') array = list() hash4a = list() finally: return array, hash4a # The "ventilator" function generates an array of nelem fix-sized (of slen # bytes long) random strings, and sends the array down a zeromq "PUSH" # connection to be processed by listening workers, in a round robin load # balanced fashion. def ventilator(): # Initialize a zeromq context context = zmq.Context() # Set up a channel to send work ventilator_send = context.socket(zmq.PUSH) ventilator_send.bind("tcp://127.0.0.1:5557") # Give everything a second to spin up and connect time.sleep(1) # Create the input array nelem = number_of_elements slen = string_length payloads = create_inputs(nelem, slen) # Send an array to each worker for num in range(np): work_message = { 'num' : payloads } ventilator_send.send_pyobj(work_message) time.sleep(1) # The "worker" functions listen on a zeromq PULL connection for "work" # (array to be processed) from the ventilator, get the length of the array # and send the results down another zeromq PUSH connection to the results # manager. def worker(wrk_num): # Initialize a zeromq context context = zmq.Context() # Set up a channel to receive work from the ventilator work_receiver = context.socket(zmq.PULL) work_receiver.connect("tcp://127.0.0.1:5557") # Set up a channel to send result of work to the results reporter results_sender = context.socket(zmq.PUSH) results_sender.connect("tcp://127.0.0.1:5558") # Set up a channel to receive control messages over control_receiver = context.socket(zmq.SUB) control_receiver.connect("tcp://127.0.0.1:5559") control_receiver.setsockopt(zmq.SUBSCRIBE, "") # Set up a poller to multiplex the work receiver and control receiver channels poller = zmq.Poller() poller.register(work_receiver, zmq.POLLIN) poller.register(control_receiver, zmq.POLLIN) # Loop and accept messages from both channels, acting accordingly while True: socks = dict(poller.poll()) # If the message came from work_receiver channel, get the length # of the array and send the answer to the results reporter if socks.get(work_receiver) == zmq.POLLIN: #work_message = work_receiver.recv_json() work_message = work_receiver.recv_pyobj() length = len(work_message['num'][0]) answer_message = { 'worker' : wrk_num, 'result' : length } results_sender.send_json(answer_message) # If the message came over the control channel, shut down the worker. if socks.get(control_receiver) == zmq.POLLIN: control_message = control_receiver.recv() if control_message == "FINISHED": print("Worker %i received FINSHED, quitting!" % wrk_num) break # The "results_manager" function receives each result from multiple workers, # and prints those results. When all results have been received, it signals # the worker processes to shut down. def result_manager(): # Initialize a zeromq context context = zmq.Context() # Set up a channel to receive results results_receiver = context.socket(zmq.PULL) results_receiver.bind("tcp://127.0.0.1:5558") # Set up a channel to send control commands control_sender = context.socket(zmq.PUB) control_sender.bind("tcp://127.0.0.1:5559") for task_nbr in range(np): result_message = results_receiver.recv_json() print "Worker %i answered: %i" % (result_message['worker'], result_message['result']) # Signal to all workers that we are finsihed control_sender.send("FINISHED") time.sleep(5) if __name__ == "__main__": # Create a pool of workers to distribute work to for wrk_num in range(pool_size): Process(target=worker, args=(wrk_num,)).start() # Fire up our result manager... result_manager = Process(target=result_manager, args=()) result_manager.start() # Start the ventilator! ventilator = Process(target=ventilator, args=()) ventilator.start()

导入时间导入zmq 从多处理导入进程，cpu\U计数 np=cpu_计数（）池大小=np 元素的数量=128 #奇怪的是，为什么一旦slen达到3MB或更高，代码就会挂起？字符串长度=1024*1024*3 def create_输入（nelem、slen、pb=True）： ''' 生成包含nelem fix大小（slen字节数）的数组的随机字符串和附带的十六进制摘要数组前者的要素。两者都以元组的形式返回。：type nelem:int ：param nelem：要生成的文件中所需的元素数数组。：类型slen:int ：param slen：每个数组元素所需的字节数。：pb类型：bool ：param pb：如果为True，则在输入数组期间显示文本进度条一代 ''' 从操作系统导入urandom 导入系统导入hashlib 如果pb：如果nelem问题在于，呼吸机（推送）插座在发送完成之前已经关闭。呼吸机功能结束时，您的睡眠时间为1s ，不足以发送384MB的信息。这就是为什么你有阈值，如果睡眠时间短，那么阈值就会低也就是说，LINGER应该阻止这种事情发生，所以我会用zeromq提出这个问题：PUSH似乎不尊重LINGER 对于您的特定示例（不添加不确定的长睡眠）的修复方法是使用与工人相同的结束信号来终止呼吸机。这样，你就可以保证你的呼吸机在需要的时候能存活多久改良呼吸机： def ventilator(): # Initialize a zeromq context context = zmq.Context() # Set up a channel to send work ventilator_send = context.socket(zmq.PUSH) ventilator_send.bind("tcp://127.0.0.1:5557") # Set up a channel to receive control messages control_receiver = context.socket(zmq.SUB) control_receiver.connect("tcp://127.0.0.1:5559") control_receiver.setsockopt(zmq.SUBSCRIBE, "") # Give everything a second to spin up and connect time.sleep(1) # Create the input array nelem = number_of_elements slen = string_length payloads = create_inputs(nelem, slen) # Send an array to each worker for num in range(np): work_message = { 'num' : payloads } ventilator_send.send_pyobj(work_message) # Poll for FINISH message, so we don't shutdown too early poller = zmq.Poller() poller.register(control_receiver, zmq.POLLIN) while True: socks = dict(poller.poll()) if socks.get(control_receiver) == zmq.POLLIN: control_message = control_receiver.recv() if control_message == "FINISHED": print("Ventilator received FINSHED, quitting!") break # else: unhandled message 我做了更多的实验：将元素的数量减少到64，将字符串长度增加到6。代码仍然运行良好。除此之外，出现了同样的症状。这让我相信pyzmq绑定中可能存在一个总的消息大小限制。0MQ C API有一个zmq_msg_init_size（3）函数，我在pyzmq的文档中找不到这个函数。这可能是原因吗？你能追踪到它挂在哪里吗？这可能会给你一个提示。我在我的mac笔记本电脑上用字符串_length=1024*1024*4尝试了你的代码，它运行得很好，所以我猜它一定与某种内存争用有关……然后再次运行，它冻结了。。。从“顶部”看，可用内存在0附近反弹，所以看起来0mq没有优化以处理这种大小的邮件。@Aaron Watters。我得出了与你相似的结论。但是，在我把自己的手指指向0MQ本身之前，我会在C++中找到一些时间来完成以上操作。我在快速浏览源代码时注意到，即使pyzmq使用zmq_msg_init_size（），它也不会公开它。想知道是否与功能，结果可能会有所不同？明克，许多人感谢有洞察力的答案。非常有帮助！我并不怀疑ZMQ_LINGER值是由ZMQ_setsockopt（3）设置的，因为正如您所说，默认值是-1（无限）。太棒了！我肯定会首先向pyzmq的人提出这个问题，并在zeromq邮件列表中提到它。我测试了您的修复程序，将字符串长度设置为1024*1024*10，使笔记本的物理内存达到最大，但仍然得到了预期的结果。再次感谢！也许不值得跟“pyzmq人”提出来，因为现在基本上就是我。我已经ping了libzmq，并用C编写了一个更简单的测试用例：