Windows MS MPI权限错误

Windows MS MPI权限错误,windows,mpi,cluster-computing,account,ms-mpi,Windows,Mpi,Cluster Computing,Account,Ms Mpi,我有两台机器都安装了MSMPI7.1,一台叫做服务器,另一台叫做计算。 这些机器是在局域网上的一个简单的windows工作组(无DA)中设置的,它们都有一个具有相同名称和密码的帐户 两者都在运行MSMPILaunchSvc服务。 这两台机器都可以在本地执行MPI作业,通过使用hostname命令进行测试进行验证 SERVER> mpiexec -hosts 1 SERVER 1 hostname SERVER or COMPUTE> mpiexec -hosts 1 COMPUTE

我有两台机器都安装了MSMPI7.1,一台叫做服务器,另一台叫做计算。 这些机器是在局域网上的一个简单的windows工作组(无DA)中设置的,它们都有一个具有相同名称和密码的帐户

两者都在运行MSMPILaunchSvc服务。 这两台机器都可以在本地执行MPI作业,通过使用
hostname
命令进行测试进行验证

SERVER> mpiexec -hosts 1 SERVER 1 hostname
SERVER
or
COMPUTE> mpiexec -hosts 1 COMPUTE 1 hostname
COMPUTE
在机器本身的终端上

我在两台机器上都禁用了防火墙,使事情变得更简单

我的问题是无法让MPI从远程主机上的服务器运行作业:

1:使用MSMPILaunchSvc的服务器->使用MSMPILaunchSvc计算

SERVER> mpiexec -hosts 1 COMPUTE 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 1722

Aborting: mpiexec on SERVER is unable to connect to the smpd service on COMPUTE:8677
Other MPI error, error stack:
connect failed - The RPC server is unavailable.  (errno 1722)
COMPUTE> mpiexec -hosts 1 SERVER 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 5

Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied.  (errno 5)
更令人沮丧的是,只有在某些时候我才会被提示输入密码。它建议SERVER\Maarten作为COMPUTE的用户,我已经作为SERVER登录的帐户不应该存在于COMPUTE上(那么应该是COMPUTE\Maarten?)。尽管如此,它也失败了:

SERVER>mpiexec -hosts 1 COMPUTE 1 hostname.exe -pwd
Enter Password for SERVER\Maarten:
Save Credentials[y|n]? n
ERROR: Failed to connect to SMPD Manager Instance error 1726

Aborting: mpiexec on SERVER is unable to connect to the 
smpd manager on COMPUTE:50915 error 1726
2:使用MSMPILaunchSvc计算->使用MSMPILaunchSvc的服务器

SERVER> mpiexec -hosts 1 COMPUTE 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 1722

Aborting: mpiexec on SERVER is unable to connect to the smpd service on COMPUTE:8677
Other MPI error, error stack:
connect failed - The RPC server is unavailable.  (errno 1722)
COMPUTE> mpiexec -hosts 1 SERVER 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 5

Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied.  (errno 5)
3:使用MSMPILaunchSvc计算->使用smpd守护进程的服务器

Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on  SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied.  (errno 5)
ERROR: Failed to connect to SMPD Manager Instance error 1726

Aborting: mpiexec on SERVER is unable to connect to the smpd manager on 
COMPUTE:51022 error 1726
4:带有MSMPILaunchSvc的服务器->带有smpd守护进程的计算

Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on  SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied.  (errno 5)
ERROR: Failed to connect to SMPD Manager Instance error 1726

Aborting: mpiexec on SERVER is unable to connect to the smpd manager on 
COMPUTE:51022 error 1726
更新:

尝试在两个节点上使用smpd守护程序时,我会遇到以下错误:

[-1:9796] Authentication completed. Successfully obtained Context for Client.
[-1:9796] version check complete, using PMP version 3.
[-1:9796] create manager process (using smpd daemon credentials)
[-1:9796] smpd reading the port string from the manager
[-1:9848] Launching smpd manager instance.
[-1:9848] created set for manager listener, 376
[-1:9848] smpd manager listening on port 51149
[-1:9796] closing the pipe to the manager
[-1:9848] Authentication completed. Successfully obtained Context for Client.
[-1:9848] Authorization completed.
[-1:9848] version check complete, using PMP version 3.
[-1:9848] Received session header from parent id=1, parent=0, level=0
[01:9848] Connecting back to parent using host SERVER and endpoint 17979
[01:9848] Previous attempt failed with error 5, trying to authenticate without Kerberos
[01:9848] Failed to connect back to parent error 5.
[01:9848] ERROR: Failed to connect back to parent 'ncacn_ip_tcp:SERVER:17979' error 5
[01:9848] smpd manager successfully stopped listening.
[01:9848] SMPD exiting with error code 4294967293.
在主机上:

[-1:12264] Launching SMPD service.
[-1:12264] smpd listening on port 8677
[-1:12264] Authentication completed. Successfully obtained Context for Client.
[-1:12264] version check complete, using PMP version 3.
[-1:12264] create manager process (using smpd daemon credentials)
[-1:12264] smpd reading the port string from the manager
[-1:16668] Launching smpd manager instance.
[-1:16668] created set for manager listener, 364
[-1:16668] smpd manager listening on port 18033
[-1:12264] closing the pipe to the manager
[-1:16668] Authentication completed. Successfully obtained Context for Client.
[-1:16668] Authorization completed.
[-1:16668] version check complete, using PMP version 3.
[-1:16668] Received session header from parent id=1, parent=0, level=0
[01:16668] Connecting back to parent using host SERVER and endpoint 18031
[01:16668] Authentication completed. Successfully obtained Context for Client.
[01:16668] Authorization completed.
[01:16668] handling command SMPD_CONNECT src=0
[01:16668] now connecting to COMPUTE
[01:16668] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
[01:16668] using spn msmpi/COMPUTE to contact server
[01:16668] SERVER posting a re-connect to COMPUTE:51161 in left child context.
[01:16668] ERROR: Failed to connect to SMPD Manager Instance error 1726
[01:16668] sending abort command to parent context.
[01:16668] posting command SMPD_ABORT to parent, src=1, dest=0.
[01:16668] ERROR: smpd running on SERVER is unable to connect to smpd service on COMPUTE:8677
[01:16668] Handling cmd=SMPD_ABORT result
[01:16668] cmd=SMPD_ABORT result will be handled locally
[01:16668] parent terminated unexpectedly - initiating cleaning up.
[01:16668] no child processes to kill - exiting with error code -1

经过反复试验,我发现在尝试使用不同配置运行MS MPI时(在我的示例中,是HPC Cluster 2008和HPC Cluster 2012与MSMPI的混合),会出现这些错误和其他非特定错误

解决方案是使用HPC Cluster 2008将所有节点降级到Windows Server 2008 R2。因为我不使用AD,所以我不得不退回到使用SMPD守护进程,并为其添加防火墙规则(同时跳过集群管理工具)