Python多处理:在pool.join()之后丢失计数?
我试图解决这个问题,我存储给定长度的子字符串的位置和计数。由于字符串可能很长(基因组序列),我尝试使用多个过程来加速它。程序运行时,一旦线程结束,存储对象的变量似乎会丢失所有信息Python多处理:在pool.join()之后丢失计数?,python,multiprocessing,Python,Multiprocessing,我试图解决这个问题,我存储给定长度的子字符串的位置和计数。由于字符串可能很长(基因组序列),我尝试使用多个过程来加速它。程序运行时,一旦线程结束,存储对象的变量似乎会丢失所有信息 import numpy import multiprocessing from multiprocessing.managers import BaseManager, DictProxy from collections import defaultdict, namedtuple, Counter from fu
import numpy
import multiprocessing
from multiprocessing.managers import BaseManager, DictProxy
from collections import defaultdict, namedtuple, Counter
from functools import partial
import ctypes as c
class MyManager(BaseManager):
pass
MyManager.register('defaultdict', defaultdict, DictProxy)
def gc_count(seq):
return int(100 * ((seq.upper().count('G') + seq.upper().count('C') + 0.0) / len(seq)))
def getreads(length, table, counts, genome):
genome_len = len(genome)
for start in range(0,genome_len):
gc = gc_count(genome[start:start+length])
table[ (length, gc) ].append( (start) )
counts[length,gc] +=1
if __name__ == "__main__":
g = 'ACTACGACTACGACTACGCATCAGCACATACGCATACGCATCAACGACTACGCATACGACCATCAGATCACGACATCAGCATCAGCATCACAGCATCAGCATCAGCACTACAGCATCAGCATCAGCATCAG'
genome_len = len(g)
mgr = MyManager()
mgr.start()
m = mgr.defaultdict(list)
mp_arr = multiprocessing.Array(c.c_double, 10*101)
arr = numpy.frombuffer(mp_arr.get_obj())
count = arr.reshape(10,101)
pool = multiprocessing.Pool(9)
partial_getreads = partial(getreads, table=m, counts=count, genome=g)
pool.map(partial_getreads, range(1, 10))
pool.close()
pool.join()
for i in range(1, 10):
for j in range(0,101):
print count[i,j]
for i in range(1, 10):
for j in range(0,101):
print len(m[(i,j)])
末尾的循环只会为
count
中的每个元素打印0.0
,为m
中的每个列表打印0
,因此不知何故,我丢失了所有计数。如果我在getreads(…)
函数中打印计数,我可以看到值正在增加。相反,在getreads(…)
或len(m[(i,j)])
主体中打印len(表[(长度,gc)])
只会导致0
您还可以将问题表述为map reduce问题,这样可以避免在多个进程之间共享数据(我想这会加快计算速度)。您只需要从函数(map)返回结果表和计数,并合并所有进程的结果(reduce)
回到你最初的问题
在这本书的底部有一个关于
修改dict和list中的可变值或项目。基本上,你
需要将修改后的对象重新分配给容器代理
l = table[ (length, gc) ]
l.append( (start) )
table[ (length, gc) ] = l
还有一个相关的Stackoverflow帖子
考虑到这两个因素,您可以执行以下操作:
def getreads(length, table, genome):
genome_len = len(genome)
arr = numpy.frombuffer(mp_arr.get_obj())
counts = arr.reshape(10,101)
for start in range(0,genome_len):
gc = gc_count(genome[start:start+length])
l = table[ (length, gc) ]
l.append( (start) )
table[ (length, gc) ] = l
counts[length,gc] +=1
if __name__ == "__main__":
g = 'ACTACGACTACGACTACGCATCAGCACATACGCATACGCATCAACGACTACGCATACGACCATCAGATCACGACATCAGCATCAGCATCACAGCATCAGCATCAGCACTACAGCATCAGCATCAGCATCAG'
genome_len = len(g)
mgr = MyManager()
mgr.start()
m = mgr.defaultdict(list)
mp_arr = multiprocessing.Array(c.c_double, 10*101)
arr = numpy.frombuffer(mp_arr.get_obj())
count = arr.reshape(10,101)
pool = multiprocessing.Pool(9)
partial_getreads = partial(getreads, table=m, genome=g)
pool.map(partial_getreads, range(1, 10))
pool.close()
pool.join()
arr = numpy.frombuffer(mp_arr.get_obj())
count = arr.reshape(10,101)
可以发布
BaseManager
类吗?另外,c
在这一行还没有定义:mp\u arr=multiprocessing.Array(c.c\u double,10*101)
WhoopsBaseManager
和c
是在import
语句中定义的。我当时肯定不会这样做:c=arr.reforme(10101)
Yikes,同意。谢谢,这很有效!最后,我修改了最终的程序,使其更符合map-reduce-y(并且不在进程之间共享数据),但这肯定回答了我的问题。谢谢