C++ HPX最小双节点示例设置?

C++ HPX最小双节点示例设置?,c++,hpx,C++,Hpx,HPX入门 假设您使用的是PBS或slurm。这些可能在HPC社区中很常见,但作为一名开发人员,我更习惯于这样的场景:这里有两台机器可以安装东西 现在还不清楚是否需要像slurm这样的调度器来利用多台物理机器,或者只是为了方便管理集群 我知道在运行HPX应用程序时,可以使用-l标志模拟多个位置(例如,请参见)。我希望在两个节点上运行相同的应用程序,并让它们相互通信 告诉HPX的最低要求是什么: 这是另一台具有此IP地址的机器,您可以将任务发送到该机器 或者,达到此阶段的最低slurm配置是什么

HPX入门 假设您使用的是PBS或slurm。这些可能在HPC社区中很常见,但作为一名开发人员,我更习惯于这样的场景:这里有两台机器可以安装东西

现在还不清楚是否需要像slurm这样的调度器来利用多台物理机器,或者只是为了方便管理集群

我知道在运行HPX应用程序时,可以使用-l标志模拟多个位置(例如,请参见)。我希望在两个节点上运行相同的应用程序,并让它们相互通信

告诉HPX的最低要求是什么:
这是另一台具有此IP地址的机器,您可以将任务发送到该机器

或者,达到此阶段的最低slurm配置是什么

安装slurm很容易找到一个简单的2节点示例,但可能会有所帮助

我还假设HPX的包端口将只在TCP上工作,而不安装任何额外的东西(例如MPI)。这是正确的吗


更新 我想我越来越近了,但我还是错过了一些东西。 首先,我使用hello_world示例。可能是因为它对于2节点测试来说太简单了吗? 我希望得到与在同一节点上运行2个位置类似的输出:

APP=$HPX/bin/hello_world
$APP --hpx:node 0 --hpx:threads 4 -l2 &
$APP --hpx:node 1 --hpx:threads 4 
样本输出:

hello world from OS-thread 2 on locality 0 hello world from OS-thread 0 on locality 0 hello world from OS-thread 1 on locality 1 hello world from OS-thread 3 on locality 1 hello world from OS-thread 2 on locality 1 hello world from OS-thread 1 on locality 0 hello world from OS-thread 0 on locality 1 hello world from OS-thread 3 on locality 0 我在两台机器上都打开了端口7910。 在两个节点上,$APP的路径相同。 我不确定如何测试第二个进程是否正在与agas服务器通信

如果我使用“--hpx:debug-agas-log=agas.log”和“--hpx:debug-hpx-log=hpx.log”&我得到:

正如另一位所建议的(现在删除了?)回答时没有运气


更新2

我认为这可能是一个防火墙问题,即使禁用了防火墙,似乎什么都没有发生。我尝试对系统调用进行跟踪,但没有明显的结果:

echo "start server on agas master: node0=$NODE0"
strace -o node0.strace $APP \
 --hpx:localities=2 --hpx:agas=$NODE0:7910 --hpx:hpx=$NODE0:7910 --hpx:threads 4 &
cat agas.log hpx.log
echo "start worker on slave: node1=$NODE1"
ssh $NODE1 \
strace -o node1.strace $APP \
--hpx:worker --hpx:agas=$NODE0:7910 --hpx.hpx=$NODE1:7910 
echo "done"
exit 0
节点0.strace的尾部:

15:13:31 bind(7, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("172.29.0.160")}, 16) = 0 15:13:31 listen(7, 128) = 0 15:13:31 ioctl(7, FIONBIO, [1]) = 0 15:13:31 accept(7, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) ... 15:13:32 mprotect(0x7f12b2bff000, 4096, PROT_NONE) = 0 15:13:32 clone(child_stack=0x7f12b33feef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f12b33ff9d0, tls=0x7f12b33ff700, child_tidptr=0x7f12b33ff9d0) = 22394 15:13:32 futex(0x7ffe2c5df60c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0 15:13:32 futex(0x7ffe2c5df5e0, FUTEX_WAKE_PRIVATE, 1) = 0 15:13:32 futex(0x7ffe2c5df4b4, FUTEX_WAIT_PRIVATE, 1, NULL 15:13:31绑定(7,{sa_family=AF_INET,sin_port=htons(7910),sin_addr=INET_addr(“172.29.0.160”)},16)=0 15:13:31听(7128)=0 15:13:31 ioctl(7,FIONBIO,[1])=0 15:13:31接受(7,0,NULL)=-1 EAGAIN(资源暂时不可用) ... 15:13:32 mprotect(0x7F12B2BF000,4096,无保护)=0 15:13:32克隆(子堆栈=0x7f12b33feef0,标志=clone_VM | clone_FS | clone_FILES | clone_SIGHAND | clone_THREAD | clone_SYSVSEM | clone| clone|u SETTID | clone| child | CLEARTID | PARENT | tidtir | PARENT | tidptr 0x7f12b33ff9d0,tls=0x7f12b33ff700,child | tidptr=0x7f21b33ff9d0)= 15:13:32 futex(0x7ffe2c5df60c,futex_WAIT_PRIVATE,1,NULL)=0 15:13:32 futex(0x7ffe2c5df5e0,futex_WAKE_PRIVATE,1)=0 15:13:32 futex(0x7ffe2c5df4b4,futex_WAIT_PRIVATE,1,空 节点1.strace的尾部:

6829 15:13:32 bind(7, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 16829 15:13:32 listen(7, 128) = 0 16829 15:13:32 ioctl(7, FIONBIO, [1]) = 0 16829 15:13:32 accept(7, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) 16829 15:13:32 uname({sys="Linux", node="kmlwg-tddamstest3.grpitsrv.com", ...}) = 0 16829 15:13:32 eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 8 16829 15:13:32 epoll_create1(EPOLL_CLOEXEC) = 9 16829 15:13:32 timerfd_create(CLOCK_MONOTONIC, 0x80000 /* TFD_??? */) = 10 16829 15:13:32 epoll_ctl(9, EPOLL_CTL_ADD, 8, {EPOLLIN|EPOLLERR|EPOLLET, {u32=124005464, u64=140359655238744}}) = 0 16829 15:13:32 write(8, "\1\0\0\0\0\0\0\0", 8) = 8 16829 15:13:32 epoll_ctl(9, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLERR, {u32=124005476, u64=140359655238756}}) = 0 16829 15:13:32 futex(0x7fa8006f2d24, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fa8006f2d20, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 16830 15:13:32 ) = 0 16829 15:13:32 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP 16830 15:13:32 futex(0x7fa8076432f0, FUTEX_WAKE_PRIVATE, 1) = 0 16829 15:13:32 ) = 11 16830 15:13:32 epoll_wait(9, 16829 15:13:32 epoll_ctl(9, EPOLL_CTL_ADD, 11, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=124362176, u64=140359655595456}} 16830 15:13:32 {{EPOLLIN, {u32=124005464, u64=140359655238744}}}, 128, -1) = 1 16829 15:13:32 ) = 0 16830 15:13:32 epoll_wait(9, 16829 15:13:32 connect(11, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("172.29.0.160")}, 16 16830 15:13:32 {{EPOLLHUP, {u32=124362176, u64=140359655595456}}}, 128, -1) = 1 16830 15:13:32 epoll_wait(9, 6829 15:13:32绑定(7,{sa_family=AF_INET,sin_port=htons(7910),sin_addr=INET_addr(“127.0.0.1”)},16)=0 16829 15:13:32听(7128)=0 16829 15:13:32 ioctl(7,FIONBIO[1])=0 16829 15:13:32接受(7,0,NULL)=-1 EAGAIN(资源暂时不可用) 16829 15:13:32 uname({sys=“Linux”,node=“kmlwg-tddamstest3.grpitsrv.com”,…})=0 16829 15:13:32事件fd2(0,O|u NONBLOCK | O|u CLOEXEC)=8 16829 15:13:32 epoll_create1(epoll_CLOEXEC)=9 16829 15:13:32 timerfd_create(时钟_单调,0x80000/*TFD?*/)=10 16829 15:13:32 epoll_ctl(9,epoll_ctl_ADD,8,{EPOLLIN | EPOLLERR | EPOLLET,{u32=124005464,u64=140359655238744}})=0 16829 15:13:32写入(8,“\1\0\0\0\0\0”,8)=8 16829 15:13:32 epoll_ctl(9,epoll_ctl_ADD,10,{EPOLLIN|EPOLLERR,{u32=124005476,u64=140359655238756}})=0 16829 15:13:32 futex(0x7fa8006f2d24,futexúu WAKEúu OPúu PRIVATE,1,1,0x7fa8006f2d20,{futexúu OPúu SET,0,futexúOPúCMPúu GT,1})=1 16830 15:13:32 ) = 0 16829 15:13:32套接字(PF_INET、SOCK_STREAM、IPPROTO_TCP 16830 15:13:32 futex(0x7fa8076432f0,futex_WAKE_PRIVATE,1)=0 16829 15:13:32 ) = 11 16830 15:13:32等一下, 16829 15:13:32 epoll(9,epoll)ctl(11,{EPOLLIN | EPOLLPRI | EPOLLERR | EPOLLHUP | EPOLLET{u32=124362176,u64=140359655595456} 16830 15:13:32{{EPOLLIN,{u32=124005464,u64=140359655238744}}},128,-1)=1 16829 15:13:32 ) = 0 16830 15:13:32等一下, 16829 15:13:32连接(11,{sa_family=AF_INET,sin_port=htons(7910),sin_addr=INET_addr(“172.29.0.160”)),16 1683015:13:32{{EPOLLHUP,{u32=124362176,u64=140359655595456}}},128,-1)=1 16830 15:13:32等一下, 如果我在主进程上执行strace-f,其子进程循环将执行以下操作:

22050 15:12:46 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 12 22050 15:12:46 epoll_ctl(5, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=2395115776, u64=140516545171712}}) = 0 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1 22050 15:12:46 connect(12, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("127.0.0.1")}, 16 22041 15:12:46 epoll_wait(5, 22050 15:12:46 ) = -1 ECONNREFUSED (Connection refused) 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1 22050 15:12:46 futex(0x7fcc9cc20504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1703, {1455808366, 471644000}, ffffffff 22041 15:12:46 epoll_wait(5, 22050 15:12:46 ) = -1 ETIMEDOUT (Connection timed out) 22050 15:12:46 futex(0x7fcc9cc204d8, FUTEX_WAKE_PRIVATE, 1) = 0 22050 15:12:46 close(12) = 0 22050 15:12:46 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 12 22050 15:12:46 epoll_ctl(5, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=2395115776, u64=140516545171712}}) = 0 22050 15:12:46 connect(12, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("127.0.0.1")}, 16 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1 22050 15:12:46 ) = -1 ECONNREFUSED (Connection refused) 22041 15:12:46 epoll_wait(5, 22050 15:12:46 futex(0x7fcc9cc20504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1705, {1455808366, 572608000}, ffffffff 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1 22050 15:12:46套接字(PF_INET、SOCK_STREAM、IPPROTO_TCP)=12 22050 15:12:46 epoll_ctl(5,epoll_ctl_ADD,12,{EPOLLIN | EPOLLPRI | EPOLLERR | EPOLLHUP | EPOLLET,{u32=239515776,u64=140516545171712})=0 22041 15:12:46{{EPOLLHUP,{u32=239515776,u64=140516545171712}}},128,-1)=1 22050 15:12:46连接(12,{sa_family=AF_INET,sin_port=htons(7910),sin_addr=INET_addr(“127.0.0.1”)),16 22041 15:12:46等一下, 22050 15:12:46)=-1秒未恢复(连接被拒绝) 22041 15:12:46{{EPOLLHUP,{u32=239515776,u64=140516545171712}}},128,-1)=1 22050 15:12:46 futex(0x7fcc9cc20504,futex_WAIT_BITSET_PRIVATE | futex_CLOCK_REALTIME,1703,{1455808366471644000},ffffffff 22041 15:12:46等一下, 22050 15:12:46)=-1超时(连接超时) 22050 15:12:46 futex(0x7fcc9cc204d8,futex_WAKE_PRIVATE,1)=0 22050 15:12:46关闭(12)=0 22050 15:12:46套接字(PF_INET、SOCK_STREAM、IPPROTO_TCP)=12 22050 15:12:46 epoll_ctl(5,epoll_ctl_ADD,12,{EPOLLIN | EPOLLPRI | EPOLLERR | EPOLLHUP | EPOLLET,{u32=239515776,u64=140516545171712})=0 22050 15:12:46连接(12,{sa_family=AF_INET,sin_port=htons(7910),sin_addr=INET_addr(“127.0.0.1”)),16 22041 15:12:46{{EPOLLHUP,{u32=239515776,u64=140516545171712}}},128,-1)=1 22050 15:12:46)=-1秒未恢复(连接被拒绝) 22041 15:12:46等一下, 22050 15:12:46 futex(0x7fcc9cc20504,futex_WAIT_BITSET_PRIVATE | futex_CLOCK_REALTIME,1705,{1455808366572680000},ffffffff 2204115:12:46{{EPOL 15:13:31 bind(7, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("172.29.0.160")}, 16) = 0 15:13:31 listen(7, 128) = 0 15:13:31 ioctl(7, FIONBIO, [1]) = 0 15:13:31 accept(7, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) ... 15:13:32 mprotect(0x7f12b2bff000, 4096, PROT_NONE) = 0 15:13:32 clone(child_stack=0x7f12b33feef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f12b33ff9d0, tls=0x7f12b33ff700, child_tidptr=0x7f12b33ff9d0) = 22394 15:13:32 futex(0x7ffe2c5df60c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0 15:13:32 futex(0x7ffe2c5df5e0, FUTEX_WAKE_PRIVATE, 1) = 0 15:13:32 futex(0x7ffe2c5df4b4, FUTEX_WAIT_PRIVATE, 1, NULL 6829 15:13:32 bind(7, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 16829 15:13:32 listen(7, 128) = 0 16829 15:13:32 ioctl(7, FIONBIO, [1]) = 0 16829 15:13:32 accept(7, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) 16829 15:13:32 uname({sys="Linux", node="kmlwg-tddamstest3.grpitsrv.com", ...}) = 0 16829 15:13:32 eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 8 16829 15:13:32 epoll_create1(EPOLL_CLOEXEC) = 9 16829 15:13:32 timerfd_create(CLOCK_MONOTONIC, 0x80000 /* TFD_??? */) = 10 16829 15:13:32 epoll_ctl(9, EPOLL_CTL_ADD, 8, {EPOLLIN|EPOLLERR|EPOLLET, {u32=124005464, u64=140359655238744}}) = 0 16829 15:13:32 write(8, "\1\0\0\0\0\0\0\0", 8) = 8 16829 15:13:32 epoll_ctl(9, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLERR, {u32=124005476, u64=140359655238756}}) = 0 16829 15:13:32 futex(0x7fa8006f2d24, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fa8006f2d20, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 16830 15:13:32 ) = 0 16829 15:13:32 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP 16830 15:13:32 futex(0x7fa8076432f0, FUTEX_WAKE_PRIVATE, 1) = 0 16829 15:13:32 ) = 11 16830 15:13:32 epoll_wait(9, 16829 15:13:32 epoll_ctl(9, EPOLL_CTL_ADD, 11, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=124362176, u64=140359655595456}} 16830 15:13:32 {{EPOLLIN, {u32=124005464, u64=140359655238744}}}, 128, -1) = 1 16829 15:13:32 ) = 0 16830 15:13:32 epoll_wait(9, 16829 15:13:32 connect(11, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("172.29.0.160")}, 16 16830 15:13:32 {{EPOLLHUP, {u32=124362176, u64=140359655595456}}}, 128, -1) = 1 16830 15:13:32 epoll_wait(9, 22050 15:12:46 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 12 22050 15:12:46 epoll_ctl(5, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=2395115776, u64=140516545171712}}) = 0 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1 22050 15:12:46 connect(12, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("127.0.0.1")}, 16 22041 15:12:46 epoll_wait(5, 22050 15:12:46 ) = -1 ECONNREFUSED (Connection refused) 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1 22050 15:12:46 futex(0x7fcc9cc20504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1703, {1455808366, 471644000}, ffffffff 22041 15:12:46 epoll_wait(5, 22050 15:12:46 ) = -1 ETIMEDOUT (Connection timed out) 22050 15:12:46 futex(0x7fcc9cc204d8, FUTEX_WAKE_PRIVATE, 1) = 0 22050 15:12:46 close(12) = 0 22050 15:12:46 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 12 22050 15:12:46 epoll_ctl(5, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP|EPOLLET, {u32=2395115776, u64=140516545171712}}) = 0 22050 15:12:46 connect(12, {sa_family=AF_INET, sin_port=htons(7910), sin_addr=inet_addr("127.0.0.1")}, 16 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1 22050 15:12:46 ) = -1 ECONNREFUSED (Connection refused) 22041 15:12:46 epoll_wait(5, 22050 15:12:46 futex(0x7fcc9cc20504, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1705, {1455808366, 572608000}, ffffffff 22041 15:12:46 {{EPOLLHUP, {u32=2395115776, u64=140516545171712}}}, 128, -1) = 1
locality 0:
./yourapp --hpx:localities=2 --hpx:agas=node0:7910 --hpx:hpx=node0:7910 

locality 1:
./yourapp --hpx:agas=node0:7910 --hpx:hpx=node1:7910 --hpx:worker
./yourapp -l2 -0 &
./yourapp -1
mpirun -N1 -np2 ./yourapp
`bin/hello_world -l2 --hpx:agas=xx.xx.xx.AA:7910 --hpx:hpx=xx.xx.xx.AA:7910 `
`bin/hello_world --hpx:agas=xx.xx.xx.AA:7910 --hpx:hpx=xx.xx.xx.BB:7910 --hpx:worker`