这是我能从python多进程中得到的最多的吗?

这是我能从python多进程中得到的最多的吗?,python,python-3.x,multithreading,Python,Python 3.x,Multithreading,所以我有一个文本文件中的数据。每一行都是要进行的计算。这个文件大约有100000行 首先,我将所有内容加载到ram中,然后我有一个方法来执行计算,并给出以下结果: def process(data_line): #do computation return result 然后我用2000行数据包这样调用它,然后将结果保存到磁盘: POOL_SIZE = 15 #nbcore - 1 PACKET_SIZE = 2000 pool = Pool(processes=POOL_SI

所以我有一个文本文件中的数据。每一行都是要进行的计算。这个文件大约有100000行

首先,我将所有内容加载到ram中,然后我有一个方法来执行计算,并给出以下结果:

def process(data_line):
    #do computation
    return result
然后我用2000行数据包这样调用它,然后将结果保存到磁盘:

POOL_SIZE = 15 #nbcore - 1
PACKET_SIZE = 2000
pool = Pool(processes=POOL_SIZE)

data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets = int(number_of_lines/ PACKET_SIZE)
for i in range(number_of_packets):
    lines_packet = data_lines[:PACKET_SIZE]
    data_lines = data_lines[PACKET_SIZE:]
    results = pool.map(process, lines_packet)
    save_computed_data_to_disk(to_be_computed_filename, results)

# process the last packet, which is smaller
results.extend(pool.map(process, data_lines))
save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")
问题是,当我写磁盘时,我的CPU什么也不计算,并且有8个内核。它正在查看任务管理器,似乎损失了很多CPU时间

我必须在完成计算后写入磁盘,因为结果比输入大1000倍。 无论如何,我将不得不在某个时候写入磁盘。如果不在这里浪费时间,它将在以后失去

我能做些什么来允许一个内核写入磁盘,同时仍然和其他内核一起计算?切换到C

按照这个速度,我可以在75小时内处理1亿条生产线,但我有120亿条生产线要处理,所以任何改进都是好的

计时示例:

Processing packet 2/15 953 of C:/processing/drop_zone\to_be_processed_txt_files\t_to_compute_303620.txt
Lauching task and waiting for it to finish...
Task completed, Continuing
Packet was processed in 11.534576654434204 seconds
We are currently going at a rate of 0.002306915330886841 sec/words
Wich is 433.47928145051293 words per seconds
Saving in temporary file
Printing writing 5000 computed line to disk took 0.04400920867919922 seconds
saving word to resume from : 06 20 25 00 00
Estimated time for processing the remaining packets is : 51:19:25

注意:此SharedMemory仅适用于Python>=3.8,因为它第一次出现在那里

启动3种进程:读卡器、处理器、写入器

让读卡器进程以增量方式读取文件,并通过
shared_memory
和Queue共享结果

让处理器使用队列,使用共享内存,并通过另一个队列返回结果。同样,作为共享内存

让Writer进程使用第二个队列,写入目标文件

让他们都通过一些
事件
s或
DictProxy
与主流程进行通信,主流程将充当协调器


例子:
导入时间
随机输入
导入hashlib
将多处理作为MP导入
从队列导入队列,为空
#无检查性肾盂炎
从multiprocessing.shared\u内存导入SharedMemory
从输入import Dict开始,列出
def readerfunc(
shm_arr:List[SharedMemory],q_out:Queue,proc_ready:Dict[str,bool]
):
numshm=len(shm_arr)
对于范围(1,6)内的批次:
打印(f“读取批次#{batch}”)
对于shm_arr中的shm:
####模拟阅读####
对于范围内的j(0,尺寸):
shm.buf[j]=random.randint(0255)
#### ####
q_输出量((批次,shm))
#需要在此处同步,因为我们正在重用相同的SharedMemory,
#所以必须等到所有处理器都完成后再发送
#下一批
而不是q_out.empty()或not all(pror_ready.values()):
时间。睡眠(1.0)
def processorfunc(
q_-in:Queue,q_-out:Queue,自杀:type(MP.Event()),procr_-ready:Dict[str,bool]
):
pname=MP.current_process().name
PRORC_ready[pname]=错误
尽管如此:
时间。睡眠(1.0)
PRORC_ready[pname]=真
如果q_in.empty()和自杀.is_set():
打破
尝试:
批处理,shm=q_in.get_nowait()
除空外:
持续
打印(pname,“获得批次”,批次)
PRORC_ready[pname]=错误
####模拟加工####
h=hashlib.blake2b(shm.buf,摘要大小=4,person=b“处理器”)
时间。睡眠(随机。均匀(5.0,7.0))
#### ####
q_out.put((pname,h.hexdigest())
def writerfunc(q_in:队列,自杀:类型(MP.Event()):
尽管如此:
时间。睡眠(1.0)
如果q_in.empty()和自杀.is_set():
打破
尝试:
pname,digest=q_in.get_nowait()
除空外:
持续
打印(“书写”、姓名、摘要)
####模拟写作####
时间。睡眠(随机。均匀(3.0,6.0))
#### ####
打印(“书写”,pname,摘要,“完成”)
def main():
shm_arr=[
SharedMemory(create=True,size=1024)
适用于范围(0,5)内的
]
q_read=MP.Queue()
q_write=MP.Queue()
pror_ready=MP.Manager().dict()
poison=MP.Event()
毒药
reader=MP.Process(target=readerfunc,args=(shm\u arr,q\u read,procr\u ready))
procrs=[]
对于范围(0,3)内的n:
p=MP.过程(
target=processorfunc,name=f“Proc{n}”,args=(q_read,q_write,poison,Proc_ready)
)
procs.append(p)
writer=MP.Process(target=writerfunc,args=(q_write,poison))
reader.start()
[p.start()表示procrs中的p]
writer.start()
reader.join()
打印(“读卡器已结束”)
而不是全部(pror_ready.values()):
时间。睡眠(5.0)
毒药
[p.join()表示procrs中的p]
打印(“处理器已结束”)
writer.join()
打印(“作者已结束”)
[shm.close()用于shm\u arr中的shm]
[shm.unlink()用于shm\u arr中的shm]
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu':
main()

代码首先想到的是在线程中运行saving函数。这样我们就排除了等待磁盘写入的瓶颈。像这样:

executor = ThreadPoolExecutor(max_workers=2)
future = executor.submit(save_computed_data_to_disk, to_be_computed_filename, results)
saving_futures.append(future)
...
concurrent.futures.wait(saving_futures, return_when=ALL_COMPLETED)  # wait all saved to disk after processing
print("Done")


您说您有8个内核,但您有:

POOL_SIZE = 15 #nbcore - 1
假设您想让一个处理器空闲(大概是主进程的空闲),为什么这个数字不是7?但为什么你甚至想读一个处理器免费?您正在连续调用
map
。当主进程等待这些调用返回时,它需要知道CPU。这就是为什么当实例化池时,如果不指定池大小,则默认为您拥有的CPU数量,而不是该数字减去1。下面我将有更多的话要说

由于您有一个非常大的内存列表,您是否有可能在循环的每个迭代中花费腰围周期来重写此列表。相反,您可以只获取列表的一部分,并将其作为iterable参数传递给
map

POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)
with Pool(processes=POOL_SIZE) as pool:
    offset = 0
    for i in range(number_of_packets):
        results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
        offset += PACKET_SIZE
        save_computed_data_to_disk(to_be_computed_filename, results)
    if remainder:
        results = pool.map(process, data_lines[offset:offset+remainder])
        save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")
在每次调用
map
之间,主要过程是将结果写入
to_be_computed_filename
。同时,<
import multiprocessing
import queue
import threading

POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)

def save_data(q):
    while True:
        results = q.get()
        if results is None:
            return # signal to terminate
        save_computed_data_to_disk(to_be_computed_filename, results)

q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()

with Pool(processes=POOL_SIZE) as pool:
    offset = 0
    for i in range(number_of_packets):
        results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
        offset += PACKET_SIZE
        q.put(results)
    if remainder:
        results = pool.map(process, data_lines[offset:offset+remainder])
        q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")
import multiprocessing
import queue
import threading

POOL_SIZE = 15 # ????
PACKET_SIZE = 2000


def save_data(q):
    while True:
        results = q.get()
        if results is None:
            return # signal to terminate
        save_computed_data_to_disk(to_be_computed_filename, results)


def read_data():
    """
    yield lists of PACKET_SIZE
    """
    lines = []
    with open(some_file, 'r') as f:
        for line in iter(f.readline(), ''):
            lines.append(line)
            if len(lines) == PACKET_SIZE:
                yield lines
                lines = []
        if lines:
            yield lines

q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()

with Pool(processes=POOL_SIZE) as pool:
    for l in read_data():
        results = pool.map(process, l)
        q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")
import multiprocessing as mp

data_lines = [0]*10000 # read it from file
size = 2000

# Split the list into a list of list (with chunksize `size`)
work = [data_lines[i:i + size] for i in range(0, len(data_lines), size)]

def process(data):
    result = len(data) # some something fancy
    return result

with mp.Pool() as p:
    result = p.map(process, work)

save_computed_data_to_disk(file_name, result)