Python多处理和MPI多节点
这是我的问题。我正在编写一个现有的python多处理代码(在单个节点上运行),目标是使用MPI for python(mpi4py)执行该代码的多节点执行 两个节点之间的MPI通信仅由每个MPI进程的主线程完成(调用MPI.Init()的线程,您可以通过调用MPI.Is_thread_main()函数知道是哪个线程完成的),但不幸的是,该线程不起作用 事实上,在python进程启动之后,MPI通信就不起作用了 为了解释这个问题,我重新编写了一个简短的代码,其中有完全相同的问题Python多处理和MPI多节点,python,multiprocessing,mpi,communication,Python,Multiprocessing,Mpi,Communication,这是我的问题。我正在编写一个现有的python多处理代码(在单个节点上运行),目标是使用MPI for python(mpi4py)执行该代码的多节点执行 两个节点之间的MPI通信仅由每个MPI进程的主线程完成(调用MPI.Init()的线程,您可以通过调用MPI.Is_thread_main()函数知道是哪个线程完成的),但不幸的是,该线程不起作用 事实上,在python进程启动之后,MPI通信就不起作用了 为了解释这个问题,我重新编写了一个简短的代码,其中有完全相同的问题 import os
import os
import psutil
import multiprocessing
import numpy as np
import Queue
import time
from mpi4py import rc
rc.initialize = False # if = True, The Init is done when "from mpi4py import MPI" is called
rc.thread_level = 'funneled'
from mpi4py import MPI
def infiniteloop(arg):
while True:
print(arg)
time.sleep(1)
# Check if the worker think it's the thread who called MPI.Init()
print("Worker is Main Thread %s" %(MPI.Is_thread_main()))
print("Rank %d on %s, Process PID for worker = %d" %(MPI.COMM_WORLD.Get_rank(),MPI.Get_processor_name(),os.getpid()))
if __name__ == '__main__':
MPI.Init()
# In the code I'm working on, MPI.Init() has to be done before the miltiprocess initialization
proc = multiprocessing.Process(target=infiniteloop, args=('RunningWorker',))
proc.start()
print("MultiProcess Stared")
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size_mpi = comm.Get_size()
while True:
print("Running Main Thread")
print("Main Thread is Main Thread %s" %( MPI.Is_thread_main()))
print("Rank %d on %s, Process PID for main = %s" %(MPI.COMM_WORLD.Get_rank(),MPI.Get_processor_name(),os.getpid()))
print("Rank %d on %s, rc.thread_level = %s" %((MPI.COMM_WORLD.Get_rank(),MPI.Get_processor_name(), rc.thread_level)))
time.sleep(1)
# Start MPI Communication, It is just an example of 2D array communication which I know it works
print("Start MPI Transfert")
#*************** Multiple SEND AND RECEIVE for 2D Array fill randomly
SumPsfmean = None
TransPsfmean = None
TABSIZE = 100
# Create 100x100 array with random np.float64 values
# (Because it's really close from the case I'm intersting for)
# Row per row communication
if rank == 0:
psfmean = np.random.rand(TABSIZE,TABSIZE)
print(psfmean.dtype)
else:
psfmean = np.random.rand(TABSIZE,TABSIZE)
psfmean_shape = psfmean.shape
if rank == 0:
SumPsfmean = np.array(range(psfmean.size*(size_mpi-1)), dtype = np.float64)
SumPsfmean.shape = (size_mpi-1, psfmean_shape[0], psfmean_shape[1])
TransPsfmean = np.array(range(psfmean[0].size), dtype = np.float64)
for i in range(psfmean_shape[1]):
print("Rank %d : Send&Receive nb %d" %(rank, I))
if rank == 0:
comm.Recv(TransPsfmean, source=1, tag=i)
elif rank ==1:
comm.Send(psfmean[i], dest=0, tag=i)
print"End Send&Receive %d" %i
if rank == 0:
for k in range(size_mpi):
if k != 0:
SumPsfmean[k-1][i] = TransPsfmean
proc.join()
在本例中,仅在2个不同节点上创建2个MPI进程。
因此,在初始化MPI之后,主函数将创建并启动一个python进程,然后将启动2MPI进程之间的通信。MPI_列组_1将每行发送2D数组(100行)的信息,MPI_列组_0将等待接收该信息
结果是:
[0] MultiProcess Stared
[0] Running Main Thread
[0] Main Thread is Main Thread : True
[0] Rank 0 on genji271, Process PID for main = 227040
[0] Rank 0 on genji271, rc.thread_level = funneled
[1] MultiProcess Stared
[1] Running Main Thread
[1] Main Thread is Main Thread : True
[1] Rank 1 on genji272, Process PID for main = 211028
[1] Rank 1 on genji272, rc.thread_level = funneled
[0] RunningWorker[0]
[1] RunningWorker
[0] Start MPI Transfert
[0] float64
[1] Start MPI Transfert
[1] Rank 1 : Send&Receive nb 0
[1] End Send&Receive 0
[1] Rank 1 : Send&Receive nb 1
[1] End Send&Receive 1
[1] Rank 1 : Send&Receive nb 2
[1] End Send&Receive 2
[1] Rank 1 : Send&Receive nb 3
[1] End Send&Receive 3
[1] Rank 1 : Send&Receive nb 4
[1] End Send&Receive 4
[1] Rank 1 : Send&Receive nb 5
[1] End Send&Receive 5[1]
[1] Rank 1 : Send&Receive nb 6[1]
[1] End Send&Receive 6
[1] Rank 1 : Send&Receive nb 7
[1] End Send&Receive 7[1]
[1] Rank 1 : Send&Receive nb 8
[1] End Send&Receive 8
[1] Rank 1 : Send&Receive nb 9
[1] End Send&Receive 9
[1] Rank 1 : Send&Receive nb 10
[1] End Send&Receive 10
[1] Rank 1 : Send&Receive nb 11
[1] End Send&Receive 11
[1] Rank 1 : Send&Receive nb 12
[1] End Send&Receive 12
[1] Rank 1 : Send&Receive nb 13
[1] End Send&Receive 13
[1] Rank 1 : Send&Receive nb 14[1]
[1] End Send&Receive 14
[1] Rank 1 : Send&Receive nb 15
[0] Rank 0 : Send&Receive nb 0
[0] Worker is Main Thread : True
[0] Rank 0 on genji271, Process PID for worker = 227046
[0] RunningWorker
[1] Worker is Main Thread : True
[1] Rank 1 on genji272, Process PID for worker = 211033
[1] RunningWorker
[0] Worker is Main Thread : True
[0] Rank 0 on genji271, Process PID for worker = 227046
[0] RunningWorker
[1] Worker is Main Thread : True
[1] Rank 1 on genji272, Process PID for worker = 211033
[1] RunningWorker
[0] Worker is Main Thread : True
[0] Rank 0 on genji271, Process PID for worker = 227046
...
如您所见,辅助线程和主线程都认为是调用MPI.Init()的线程。
此外,MPI通信在2个MPI进程之间停止(在没有python进程的情况下,或者在创建进程之后完成MPI.Init时,它可以完美地工作!!)。实际上,应该接收行的MPI_rank_0在第一次迭代中被卡住,并且从未接收到第一行
我(认为)理解python进程是主线程的一种克隆(或者至少在创建进程时共享/复制主线程的内存)。因此,MPI是否可能看不到主线程及其克隆之间的差异(如果它们具有不同的PID,则为事件!!)。或者也许我做错了什么
有人能帮我吗?我非常感谢,并可以与您分享有关我的问题的更多信息。从文档中,似乎
多处理
确实启动了一个新进程(相对于新线程),这可能会混淆Is\u thread\u main()。也就是说,如果您在MPI_Init()
之后分叉一个新进程,那么可能会发生不好的事情,因此您可能不想在一开始就这样做。