Configuration slurmd.service失败&;没有PID文件/var/run/slurmd.PID

Configuration slurmd.service失败&;没有PID文件/var/run/slurmd.PID,configuration,gpu,nodes,slurm,sbatch,Configuration,Gpu,Nodes,Slurm,Sbatch,我正在尝试使用以下命令启动slurmd.service,但无法永久成功。如果您能帮助我解决这个问题,我将不胜感激 systemctl start slurmd scontrol update nodename=fwb-lab-tesla1 state=idle 这是slurmd服务的状态 cat /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon After=network.target mu

我正在尝试使用以下命令启动slurmd.service,但无法永久成功。如果您能帮助我解决这个问题,我将不胜感激

systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle
这是slurmd服务的状态

 cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity


[Install]
WantedBy=multi-user.target
这是节点的状态:

$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpucompute*    up   infinite      1  drain fwb-lab-tesla1

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Low RealMemory       root      2020-09-28T16:46:28 fwb-lab-tesla1

$ sinfo -Nl
Thu Oct  1 14:00:10 2020
NODELIST        NODES   PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
fwb-lab-tesla1      1 gpucompute*     drained   32   32:1:1  64000        0      1   (null) Low RealMemory  
这里是slurm.conf的内容

$ cat /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
# Prevent very long time waits for mix serial/parallel in multi node environment 
SchedulerParameters=pack_serial_at_end
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/filetxt
# Need slurmdbd for gres functionality
#AccountingStorageTRES=CPU,Mem,gres/gpu,gres/gpu:Titan
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
GresTypes=gpu
#NodeName=fwb-lab-tesla[1-32] Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
#PartitionName=compute Nodes=fwb-lab-tesla[1-32] Default=YES MaxTime=INFINITE State=UP
#NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
NodeName=fwb-lab-tesla1 NodeAddr=137.73.38.102 Gres=gpu:4 RealMemory=64000 CPUs=32 State=UNKNOWN
PartitionName=gpucompute Nodes=fwb-lab-tesla1 Default=YES MaxTime=INFINITE State=UP
下面的路径中没有任何slurmd.pid。只要启动系统一次,它就会出现在这里,但几分钟后又消失了

$ ls /var/run/
abrt          cryptsetup         gdm            lvm             openvpn-server  slurmctld.pid   tuned
alsactl.pid   cups               gssproxy.pid   lvmetad.pid     plymouth        sm-notify.pid   udev
atd.pid       dbus               gssproxy.sock  mariadb         ppp             spice-vdagentd  user
auditd.pid    dhclient-eno2.pid  httpd          mdadm           rpcbind         sshd.pid        utmp
avahi-daemon  dhclient.pid       initramfs      media           rpcbind.sock    sudo            vpnc
certmonger    dmeventd-client    ipmievd.pid    mount           samba           svnserve        xl2tpd
chrony        dmeventd-server    lightdm        munge           screen          sysconfig       xrdp
console       ebtables.lock      lock           netreport       sepermit        syslogd.pid     xtables.lock
crond.pid     faillock           log            NetworkManager  setrans         systemd
cron.reboot   firewalld          lsm            openvpn-client  setroubleshoot  tmpfiles.d
[shirin@FWB-Lab-Tesla Seq2KMR33]$ systemctl status slurmctld
â slurmctld.service - Slurm controller daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2020-09-28 15:41:25 BST; 2 days ago
 Main PID: 1492 (slurmctld)
   CGroup: /system.slice/slurmctld.service
           ââ1492 /usr/sbin/slurmctld

Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Starting Slurm controller daemon...
Sep 28 15:41:25 FWB-Lab-Tesla systemd[1]: Started Slurm controller daemon.
我尝试启动服务
slurmd.service
,但几分钟后它又返回失败

$ systemctl status slurmd
â slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: timeout) since Tue 2020-09-29 18:11:25 BST; 1 day 19h ago
  Process: 25650 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/slurmd.service
           ââ2986 /usr/sbin/slurmd

Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Starting Slurm node daemon...
Sep 29 18:09:55 FWB-Lab-Tesla systemd[1]: Can't open PID file /var/run/slurmd.pid (yet?) after start: No ...ctory
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service start operation timed out. Terminating.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Failed to start Slurm node daemon.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: Unit slurmd.service entered failed state.
Sep 29 18:11:25 FWB-Lab-Tesla systemd[1]: slurmd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
启动slurmd的日志输出:

[2020-09-29T18:09:55.074] Message aggregation disabled
[2020-09-29T18:09:55.075] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2020-09-29T18:09:55.075] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2020-09-29T18:09:55.075] gpu device number 2(/dev/nvidia2):c 195:2 rwm
[2020-09-29T18:09:55.075] gpu device number 3(/dev/nvidia3):c 195:3 rwm
[2020-09-29T18:09:55.095] slurmd version 17.11.7 started
[2020-09-29T18:09:55.096] error: Error binding slurm stream socket: Address already in use
[2020-09-29T18:09:55.096] error: Unable to bind listen port (*:6818): Address already in use```

日志文件声明它无法绑定到标准slurmd端口6818,因为已经有其他东西在使用此地址


是否在此节点上运行另一个slurmd?还是别的什么在听?请尝试
netstat-tulpen | grep 6818
查看使用地址的内容。

您可以共享slurmd日志文件吗?(/var/log/slurm/slurmd.log)感谢您的回复。我已经附上日志文件作为下一个回复,因为这里的信件作为评论是有限制的。我设法为原始文件提供了一个链接。这是
/var/log/slurm/slurmd.log
的链接,感谢您的帮助<代码>[root@FWB-实验室Tesla shirin]#netstat-tulpen | grep 6818和
tcp 0.0.0.0:6818 0.0.0.0:*听一听0 17955 2986/slurmd