Linux PPID 1的僵尸进程和错误-无法获取属性:org.freedesktop.systemd1的激活超时
我有大约8000个PPID为1的僵尸进程。其中90%是失效的nrpe进程,原始的nrpe进程运行良好,它的PID实际上是所有失效的nrpe进程的PGID和SID。剩下的10%的失效进程是salt minion和sshd,它们同样具有PPID 1 3天前有0个僵尸,然后从7月21日开始有2000个僵尸产卵(1900个是nrpe),并且它们正在加速繁殖Linux PPID 1的僵尸进程和错误-无法获取属性:org.freedesktop.systemd1的激活超时,linux,centos7,systemd,zombie-process,nrpe,Linux,Centos7,Systemd,Zombie Process,Nrpe,我有大约8000个PPID为1的僵尸进程。其中90%是失效的nrpe进程,原始的nrpe进程运行良好,它的PID实际上是所有失效的nrpe进程的PGID和SID。剩下的10%的失效进程是salt minion和sshd,它们同样具有PPID 1 3天前有0个僵尸,然后从7月21日开始有2000个僵尸产卵(1900个是nrpe),并且它们正在加速繁殖 # top top - 08:18:34 up 296 days, 22:07, 1 user, load average: 2.07, 2.0
# top
top - 08:18:34 up 296 days, 22:07, 1 user, load average: 2.07, 2.04, 1.85
Tasks: 7659 total, 1 running, 173 sleeping, 0 stopped, 7485 zombie
%Cpu(s): 23.5 us, 1.4 sy, 0.0 ni, 74.6 id, 0.2 wa, 0.0 hi, 0.1 si, 0.1 st
KiB Mem : 32779896 total, 220536 free, 29965856 used, 2593504 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 684308 avail Mem
ps-ajx | grep失效的示例输出
1 302 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 304 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 311 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 315 309 309 ? -1 Z 74 0:00 [sshd] <defunct>
1 323 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 325 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 351 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 358 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 370 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 372 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 375 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 388 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 389 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 392 386 386 ? -1 Z 74 0:00 [sshd] <defunct>
1 395 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 409 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 411 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 412 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 414 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 426 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 428 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 440 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 460 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 462 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 464 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
journalctl-xe也显示了类似的错误
几个月以来,这一直是一个反复出现的问题,随机生产主机有大量nrpe僵尸和无法正常工作的systemd。为了补救,我重新启动了服务器。(这是一个AWS EC2实例)但我非常想了解这里发生了什么。任何指点、想法都会大有裨益
操作系统-CentOS Linux 7.1.1503版(核心版)
# ps -ef | grep defunct | grep Jul22 | wc -l
4063
1 302 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 304 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 311 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 315 309 309 ? -1 Z 74 0:00 [sshd] <defunct>
1 323 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 325 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 351 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 358 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 370 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 372 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 375 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 388 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 389 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 392 386 386 ? -1 Z 74 0:00 [sshd] <defunct>
1 395 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 409 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 411 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 412 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 414 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 426 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 428 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 440 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 460 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 462 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
1 464 2360 2360 ? -1 Z 994 0:00 [nrpe] <defunct>
Jul 23 09:36:59 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:36:59 xxxxxhostnamexxxxx dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:36:59 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:14 xxxxxhostnamexxxxx sshd[8908]: error: Could not load host key: /etc/ssh/ssh_host_dsa_key
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:24 xxxxxhostnamexxxxx newrelic-infra: time="2020-07-23T09:37:24Z" level=error msg="unable to get systemd service status" error="exit status 1"
Jul 23 09:37:25 xxxxxhostnamexxxxx dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:25 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Activating systemd to hand-off: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.login1': timed out
Jul 23 09:37:50 xxxxxhostnamexxxxx dbus-daemon: dbus[610]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
# systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ -LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN