Linux PPID 1的僵尸进程和错误-无法获取属性:org.freedesktop.systemd1的激活超时

Linux PPID 1的僵尸进程和错误-无法获取属性:org.freedesktop.systemd1的激活超时,linux,centos7,systemd,zombie-process,nrpe,Linux,Centos7,Systemd,Zombie Process,Nrpe,我有大约8000个PPID为1的僵尸进程。其中90%是失效的nrpe进程,原始的nrpe进程运行良好,它的PID实际上是所有失效的nrpe进程的PGID和SID。剩下的10%的失效进程是salt minion和sshd,它们同样具有PPID 1 3天前有0个僵尸,然后从7月21日开始有2000个僵尸产卵(1900个是nrpe),并且它们正在加速繁殖 # top top - 08:18:34 up 296 days, 22:07, 1 user, load average: 2.07, 2.0

我有大约8000个PPID为1的僵尸进程。其中90%是失效的nrpe进程,原始的nrpe进程运行良好,它的PID实际上是所有失效的nrpe进程的PGID和SID。剩下的10%的失效进程是salt minion和sshd,它们同样具有PPID 1

3天前有0个僵尸,然后从7月21日开始有2000个僵尸产卵(1900个是nrpe),并且它们正在加速繁殖

# top
top - 08:18:34 up 296 days, 22:07,  1 user,  load average: 2.07, 2.04, 1.85
Tasks: 7659 total,   1 running, 173 sleeping,   0 stopped, 7485 zombie
%Cpu(s): 23.5 us,  1.4 sy,  0.0 ni, 74.6 id,  0.2 wa,  0.0 hi,  0.1 si,  0.1 st
KiB Mem : 32779896 total,   220536 free, 29965856 used,  2593504 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   684308 avail Mem
ps-ajx | grep失效的示例输出

    1   302  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   304  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   311  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   315   309   309 ?           -1 Z       74   0:00 [sshd] <defunct>
    1   323  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   325  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   351  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   358  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   370  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   372  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   375  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   388  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   389  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   392   386   386 ?           -1 Z       74   0:00 [sshd] <defunct>
    1   395  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   409  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   411  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   412  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   414  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   426  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   428  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   440  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   460  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   462  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   464  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
journalctl-xe也显示了类似的错误

几个月以来,这一直是一个反复出现的问题,随机生产主机有大量nrpe僵尸和无法正常工作的systemd。为了补救,我重新启动了服务器。(这是一个AWS EC2实例)但我非常想了解这里发生了什么。任何指点、想法都会大有裨益

操作系统-CentOS Linux 7.1.1503版(核心版)

# ps -ef | grep defunct | grep Jul22 | wc -l
4063
    1   302  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   304  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   311  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   315   309   309 ?           -1 Z       74   0:00 [sshd] <defunct>
    1   323  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   325  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   351  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   358  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   370  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   372  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   375  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   388  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   389  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   392   386   386 ?           -1 Z       74   0:00 [sshd] <defunct>
    1   395  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   409  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   411  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   412  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   414  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   426  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   428  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   440  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   460  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   462  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
    1   464  2360  2360 ?           -1 Z      994   0:00 [nrpe] <defunct>
Jul 23 09:36:59 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:36:59 xxxxxhostnamexxxxx dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:36:59 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:14 xxxxxhostnamexxxxx sshd[8908]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx newrelic-infra: time="2020-07-23T09:37:24Z" level=error msg="unable to get systemd service status" error="exit status 1"
Jul 23 09:37:25 xxxxxhostnamexxxxx dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:25 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
# systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN