这是我能从python多进程中得到的最多的吗？_Python_Python 3.x_Multithreading

这是我能从python多进程中得到的最多的吗？

python python-3.x multithreading

这是我能从python多进程中得到的最多的吗？,python,python-3.x,multithreading,Python,Python 3.x,Multithreading,所以我有一个文本文件中的数据。每一行都是要进行的计算。这个文件大约有100000行首先，我将所有内容加载到ram中，然后我有一个方法来执行计算，并给出以下结果： def process(data_line): #do computation return result 然后我用2000行数据包这样调用它，然后将结果保存到磁盘： POOL_SIZE = 15 #nbcore - 1 PACKET_SIZE = 2000 pool = Pool(processes=POOL_SI

所以我有一个文本文件中的数据。每一行都是要进行的计算。这个文件大约有100000行

首先，我将所有内容加载到ram中，然后我有一个方法来执行计算，并给出以下结果：

def process(data_line):
    #do computation
    return result

然后我用2000行数据包这样调用它，然后将结果保存到磁盘：

POOL_SIZE = 15 #nbcore - 1
PACKET_SIZE = 2000
pool = Pool(processes=POOL_SIZE)

data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets = int(number_of_lines/ PACKET_SIZE)
for i in range(number_of_packets):
    lines_packet = data_lines[:PACKET_SIZE]
    data_lines = data_lines[PACKET_SIZE:]
    results = pool.map(process, lines_packet)
    save_computed_data_to_disk(to_be_computed_filename, results)

# process the last packet, which is smaller
results.extend(pool.map(process, data_lines))
save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")

问题是，当我写磁盘时，我的CPU什么也不计算，并且有8个内核。它正在查看任务管理器，似乎损失了很多CPU时间

我必须在完成计算后写入磁盘，因为结果比输入大1000倍。无论如何，我将不得不在某个时候写入磁盘。如果不在这里浪费时间，它将在以后失去

我能做些什么来允许一个内核写入磁盘，同时仍然和其他内核一起计算？切换到C

按照这个速度，我可以在75小时内处理1亿条生产线，但我有120亿条生产线要处理，所以任何改进都是好的

计时示例：

Processing packet 2/15 953 of C:/processing/drop_zone\to_be_processed_txt_files\t_to_compute_303620.txt
Lauching task and waiting for it to finish...
Task completed, Continuing
Packet was processed in 11.534576654434204 seconds
We are currently going at a rate of 0.002306915330886841 sec/words
Wich is 433.47928145051293 words per seconds
Saving in temporary file
Printing writing 5000 computed line to disk took 0.04400920867919922 seconds
saving word to resume from : 06 20 25 00 00
Estimated time for processing the remaining packets is : 51:19:25

注意：此SharedMemory仅适用于Python>=3.8，因为它第一次出现在那里

启动3种进程：读卡器、处理器、写入器

让读卡器进程以增量方式读取文件，并通过

shared_memory

和Queue共享结果

让处理器使用队列，使用共享内存，并通过另一个队列返回结果。同样，作为共享内存

让Writer进程使用第二个队列，写入目标文件

让他们都通过一些

事件

s或

DictProxy

与主流程进行通信，主流程将充当协调器

例子：

导入时间
随机输入
导入hashlib
将多处理作为MP导入
从队列导入队列，为空
#无检查性肾盂炎
从multiprocessing.shared\u内存导入SharedMemory
从输入import Dict开始，列出
def readerfunc(
shm_arr:List[SharedMemory]，q_out:Queue，proc_ready:Dict[str，bool]
):
numshm=len（shm_arr）
对于范围（1,6）内的批次：
打印（f“读取批次#{batch}”）
对于shm_arr中的shm：
####模拟阅读####
对于范围内的j（0，尺寸）：
shm.buf[j]=random.randint（0255）
#### ####
q_输出量（（批次，shm））
#需要在此处同步，因为我们正在重用相同的SharedMemory，
#所以必须等到所有处理器都完成后再发送
#下一批
而不是q_out.empty（）或not all（pror_ready.values（））：
时间。睡眠（1.0）
def processorfunc(
q_-in:Queue，q_-out:Queue，自杀：type（MP.Event（）），procr_-ready:Dict[str，bool]
):
pname=MP.current_process（）.name
PRORC_ready[pname]=错误
尽管如此：
时间。睡眠（1.0）
PRORC_ready[pname]=真
如果q_in.empty（）和自杀.is_set（）：
打破
尝试：
批处理，shm=q_in.get_nowait（）
除空外：
持续
打印（pname，“获得批次”，批次）
PRORC_ready[pname]=错误
####模拟加工####
h=hashlib.blake2b（shm.buf，摘要大小=4，person=b“处理器”）
时间。睡眠（随机。均匀（5.0，7.0））
#### ####
q_out.put（（pname，h.hexdigest（））
def writerfunc（q_in:队列，自杀：类型（MP.Event（））：
尽管如此：
时间。睡眠（1.0）
如果q_in.empty（）和自杀.is_set（）：
打破
尝试：
pname，digest=q_in.get_nowait（）
除空外：
持续
打印（“书写”、姓名、摘要）
####模拟写作####
时间。睡眠（随机。均匀（3.0，6.0））
#### ####
打印（“书写”，pname，摘要，“完成”）
def main（）：
shm_arr=[
SharedMemory（create=True，size=1024）
适用于范围（0,5）内的
]
q_read=MP.Queue（）
q_write=MP.Queue（）
pror_ready=MP.Manager（）.dict（）
poison=MP.Event（）
毒药
reader=MP.Process（target=readerfunc，args=（shm\u arr，q\u read，procr\u ready））
procrs=[]
对于范围（0,3）内的n：
p=MP.过程(
target=processorfunc，name=f“Proc{n}”，args=（q_read，q_write，poison，Proc_ready）
)
procs.append（p）
writer=MP.Process（target=writerfunc，args=（q_write，poison））
reader.start（）
[p.start（）表示procrs中的p]
writer.start（）
reader.join（）
打印（“读卡器已结束”）
而不是全部（pror_ready.values（））：
时间。睡眠（5.0）
毒药
[p.join（）表示procrs中的p]
打印（“处理器已结束”）
writer.join（）
打印（“作者已结束”）
[shm.close（）用于shm\u arr中的shm]
[shm.unlink（）用于shm\u arr中的shm]
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
main（）

代码首先想到的是在线程中运行saving函数。这样我们就排除了等待磁盘写入的瓶颈。像这样：

executor = ThreadPoolExecutor(max_workers=2)
future = executor.submit(save_computed_data_to_disk, to_be_computed_filename, results)
saving_futures.append(future)
...
concurrent.futures.wait(saving_futures, return_when=ALL_COMPLETED)  # wait all saved to disk after processing
print("Done")

您说您有8个内核，但您有：

POOL_SIZE = 15 #nbcore - 1

假设您想让一个处理器空闲（大概是主进程的空闲），为什么这个数字不是7？但为什么你甚至想读一个处理器免费？您正在连续调用

map

。当主进程等待这些调用返回时，它需要知道CPU。这就是为什么当实例化池时，如果不指定池大小，则默认为您拥有的CPU数量，而不是该数字减去1。下面我将有更多的话要说

由于您有一个非常大的内存列表，您是否有可能在循环的每个迭代中花费腰围周期来重写此列表。相反，您可以只获取列表的一部分，并将其作为iterable参数传递给

map

：

POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)
with Pool(processes=POOL_SIZE) as pool:
    offset = 0
    for i in range(number_of_packets):
        results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
        offset += PACKET_SIZE
        save_computed_data_to_disk(to_be_computed_filename, results)
    if remainder:
        results = pool.map(process, data_lines[offset:offset+remainder])
        save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")

在每次调用

map

之间，主要过程是将结果写入

to_be_computed_filename

。同时,<

import multiprocessing
import queue
import threading

POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)

def save_data(q):
    while True:
        results = q.get()
        if results is None:
            return # signal to terminate
        save_computed_data_to_disk(to_be_computed_filename, results)

q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()

with Pool(processes=POOL_SIZE) as pool:
    offset = 0
    for i in range(number_of_packets):
        results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
        offset += PACKET_SIZE
        q.put(results)
    if remainder:
        results = pool.map(process, data_lines[offset:offset+remainder])
        q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")

import multiprocessing
import queue
import threading

POOL_SIZE = 15 # ????
PACKET_SIZE = 2000


def save_data(q):
    while True:
        results = q.get()
        if results is None:
            return # signal to terminate
        save_computed_data_to_disk(to_be_computed_filename, results)


def read_data():
    """
    yield lists of PACKET_SIZE
    """
    lines = []
    with open(some_file, 'r') as f:
        for line in iter(f.readline(), ''):
            lines.append(line)
            if len(lines) == PACKET_SIZE:
                yield lines
                lines = []
        if lines:
            yield lines

q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()

with Pool(processes=POOL_SIZE) as pool:
    for l in read_data():
        results = pool.map(process, l)
        q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")

import multiprocessing as mp

data_lines = [0]*10000 # read it from file
size = 2000

# Split the list into a list of list (with chunksize `size`)
work = [data_lines[i:i + size] for i in range(0, len(data_lines), size)]

def process(data):
    result = len(data) # some something fancy
    return result

with mp.Pool() as p:
    result = p.map(process, work)

save_computed_data_to_disk(file_name, result)