Configuration slurmd.service失败&;没有PID文件/var/run/slurmd.PID
我正在尝试使用以下命令启动slurmd.service,但无法永久成功。如果您能帮助我解决这个问题,我将不胜感激Configuration slurmd.service失败&;没有PID文件/var/run/slurmd.PID,configuration,gpu,nodes,slurm,sbatch,Configuration,Gpu,Nodes,Slurm,Sbatch,我正在尝试使用以下命令启动slurmd.service,但无法永久成功。如果您能帮助我解决这个问题,我将不胜感激 systemctl start slurmd scontrol update nodename=fwb-lab-tesla1 state=idle 这是slurmd服务的状态 cat /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon After=network.target mu
systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle
这是slurmd服务的状态
cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
[Install]
WantedBy=multi-user.target
这是节点的状态:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpucompute* up infinite 1 drain fwb-lab-tesla1
$ sinfo -R
REASON USER TIMESTAMP NODELIST
Low RealMemory root 2020-09-28T16:46:28 fwb-lab-tesla1
$ sinfo -Nl
Thu Oct 1 14:00:10 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
fwb-lab-tesla1 1 gpucompute* drained 32 32:1:1 64000 0 1 (null) Low RealMemory
这里是slurm.conf的内容
$ cat /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
# Prevent very long time waits for mix serial/parallel in multi node environment
SchedulerParameters=pack_serial_at_end
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/filetxt
# Need slurmdbd for gres functionality
#AccountingStorageTRES=CPU,Mem,gres/gpu,gres/gpu:Titan
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
GresTypes=gpu
#NodeName=fwb-lab-tesla[1-32] Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
#PartitionName=compute Nodes=fwb-lab-tesla[1-32] Default=YES MaxTime=INFINITE State=UP
#NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 CPUs=32 State=UNKNOWN
PartitionName=gpucompute Nodes=fwb-lab-tesla1 Default=YES MaxTime=INFINITE State=UP
下面的路径中没有任何slurmd.pid。只要启动系统一次,它就会出现在这里,但几分钟后又消失了
$ ls /var/run/
abrt cryptsetup gdm lvm openvpn-server slurmctld.pid tuned
alsactl.pid cups gssproxy.pid lvmetad.pid plymouth sm-notify.pid udev
atd.pid dbus gssproxy.sock mariadb ppp spice-vdagentd user
auditd.pid dhclient-eno2.pid httpd mdadm rpcbind sshd.pid utmp
avahi-daemon dhclient.pid initramfs media rpcbind.sock sudo vpnc
certmonger dmeventd-client ipmievd.pid mount samba svnserve xl2tpd
chrony dmeventd-server lightdm munge screen sysconfig xrdp
console ebtables.lock lock netreport sepermit syslogd.pid xtables.lock
crond.pid faillock log NetworkManager setrans systemd
cron.reboot firewalld lsm openvpn-client setroubleshoot tmpfiles.d
[shirin@FWB-Lab-Tesla Seq2KMR33]$ systemctl status slurmctld
â slurmctld.service - Slurm controller daemon
Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2020-09-28 15:41:25 BST; 2 days ago
Main PID: 1492 (slurmctld)
CGroup: /system.slice/slurmctld.service
ââ1492 /usr/sbin/slurmctld
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Starting Slurm controller daemon...
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Started Slurm controller daemon.
我尝试启动服务slurmd.service
,但几分钟后它又返回失败
$ systemctl status slurmd
â slurmd.service - Slurm node daemon
Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
Active: failed (Result: timeout) since Tue 2020-09-29 18:11:25 BST; 1 day 19h ago
Process: 25650 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
CGroup: /system.slice/slurmd.service
ââ2986 /usr/sbin/slurmd
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Starting Slurm node daemon...
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?) after start: No ...ctory
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service start operation timed out. Terminating.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Failed to start Slurm node daemon.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Unit slurmd.service entered failed state.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
启动slurmd的日志输出:
[2020-09-29T18:09:55.074] Message aggregation disabled
[2020-09-29T18:09:55.075] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2020-09-29T18:09:55.075] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2020-09-29T18:09:55.075] gpu device number 2(/dev/nvidia2):c 195:2 rwm
[2020-09-29T18:09:55.075] gpu device number 3(/dev/nvidia3):c 195:3 rwm
[2020-09-29T18:09:55.095] slurmd version 17.11.7 started
[2020-09-29T18:09:55.096] error: Error binding slurm stream socket: Address already in use
[2020-09-29T18:09:55.096] error: Unable to bind listen port (*:6818): Address already in use```
日志文件声明它无法绑定到标准slurmd端口6818,因为已经有其他东西在使用此地址
是否在此节点上运行另一个slurmd?还是别的什么在听?请尝试
netstat-tulpen | grep 6818
查看使用地址的内容。您可以共享slurmd日志文件吗?(/var/log/slurm/slurmd.log)感谢您的回复。我已经附上日志文件作为下一个回复,因为这里的信件作为评论是有限制的。我设法为原始文件提供了一个链接。这是/var/log/slurm/slurmd.log
的链接,感谢您的帮助<代码>[root@FWB-实验室Tesla shirin]#netstat-tulpen | grep 6818和tcp 0.0.0.0:6818 0.0.0.0:*听一听0 17955 2986/slurmd