Kubernetes 如何使用K8S节点问题检测器？问题:_Kubernetes_Automation_Monitoring_Kubernetes Helm_Kubernetes Health Check

Kubernetes 如何使用K8S节点问题检测器？问题:

kubernetes automation monitoring

Kubernetes 如何使用K8S节点问题检测器？问题:,kubernetes,automation,monitoring,kubernetes-helm,kubernetes-health-check,Kubernetes,Automation,Monitoring,Kubernetes Helm,Kubernetes Health Check,如果是K8S，则在文件中提及。如果它不在GCE中，我们如何使用它？它是向仪表板提供信息还是提供API指标？考虑到这是一个问题，您需要在自己的Kubernetes服务器上运行 Kubernetes群集具有将使用它的群集。您的意思是：如何安装它 kubectl create -f https://github.com/kubernetes/node-problem-detector.yaml 此工具旨在使群集管理堆栈中的上游层可以看到各种节点问题。它是一个守护进程，在每个节点上运行，检测节点问题并

如果是K8S，则在文件中提及。如果它不在GCE中，我们如何使用它？它是向仪表板提供信息还是提供API指标？

考虑到这是一个问题，您需要在自己的Kubernetes服务器上运行

Kubernetes群集具有将使用它的群集。

您的意思是：如何安装它

kubectl create -f https://github.com/kubernetes/node-problem-detector.yaml

此工具旨在使群集管理堆栈中的上游层可以看到各种节点问题。它是一个守护进程，在每个节点上运行，检测节点问题并将其报告给apiserver

好吧，但是。。。这到底是什么意思？我如何判断它是否进入了api服务器？
之前和之后是什么样子？知道这一点有助于我理解它在做什么。

  log_monitors:
#https://github.com/kubernetes/node-problem-detector/tree/master/config contains the full list, you can exec into the pod and ls /config/ to see these as well.
    - /config/abrt-adaptor.json #Adds ABRT Node Events (ABRT: automatic bug reporting tool), exceptions will show up under "kubectl describe node $NODENAME | grep Events -A 20"
    - /config/kernel-monitor.json #Adds 2 new Node Health Condition Checks "KernelDeadlock" and "ReadonlyFilesystem"
    - /config/docker-monitor.json  #Adds new Node Health Condition Check "DockerDaemon" (Checks if Docker is unhealthy as a result of corrupt image)
#    - /config/docker-monitor-filelog.json #Error: "/var/log/docker.log: no such file or directory", doesn't exist on pod, I think you'd have to mount node hostpath to get it to work, gain doesn't sound worth effort.
#    - /config/kernel-monitor-filelog.json #Should add to existing Node Health Check "KernelDeadlock", more thorough detection, but silently fails in NPD pod logs for me.   

  custom_plugin_monitors: #[]
# Someone said all *-counter plugins are custom plugins, if you put them under log_monitors, you'll get #Error: "Failed to unmarshal configuration file "/config/kernel-monitor-counter.json""
    - /config/kernel-monitor-counter.json #Adds new Node Health Condition Check "FrequentUnregisteredNetDevice"
    - /config/docker-monitor-counter.json #Adds new Node Health Condition Check "CorruptDockerOverlay2"
    - /config/systemd-monitor-counter.json #Adds 3 new Node Health Condition Checks "FrequentKubeletRestart", "FrequentDockerRestart", and "FrequentContainerdRestart"

在安装节点问题检测器之前，我看到：

Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 20 Jun 2019 12:30:05 -0400   Thu, 20 Jun 2019 12:30:05 -0400   WeaveIsUp                    Weave pod has set this
  OutOfDisk            False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:30:14 -0400   KubeletReady                 kubelet is posting ready status

Bash# helm upgrade --install npd stable/node-problem-detector -f node-problem-detector.values.yaml 
Bash# kubectl rollout status daemonset npd-node-problem-detector #(wait for up) 
Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20 
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  DockerDaemon         False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   DockerDaemonHealthy          Docker daemon is healthy
  EBSHealth            False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   NoVolumeErrors               Volumes are attaching successfully
  KernelDeadlock       False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   KernelHasNoDeadlock          kernel has no deadlock
  ReadonlyFilesystem   False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   FilesystemIsNotReadOnly      Filesystem is not read-only
  NetworkUnavailable   False   Thu, 20 Jun 2019 12:30:05 -0400   Thu, 20 Jun 2019 12:30:05 -0400   WeaveIsUp                    Weave pod has set this
  OutOfDisk            False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:30:14 -0400   KubeletReady                 kubelet is posting ready status

安装节点问题检测器后，我看到：

Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 20 Jun 2019 12:30:05 -0400   Thu, 20 Jun 2019 12:30:05 -0400   WeaveIsUp                    Weave pod has set this
  OutOfDisk            False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:30:14 -0400   KubeletReady                 kubelet is posting ready status

Bash# helm upgrade --install npd stable/node-problem-detector -f node-problem-detector.values.yaml 
Bash# kubectl rollout status daemonset npd-node-problem-detector #(wait for up) 
Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20 
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  DockerDaemon         False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   DockerDaemonHealthy          Docker daemon is healthy
  EBSHealth            False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   NoVolumeErrors               Volumes are attaching successfully
  KernelDeadlock       False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   KernelHasNoDeadlock          kernel has no deadlock
  ReadonlyFilesystem   False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   FilesystemIsNotReadOnly      Filesystem is not read-only
  NetworkUnavailable   False   Thu, 20 Jun 2019 12:30:05 -0400   Thu, 20 Jun 2019 12:30:05 -0400   WeaveIsUp                    Weave pod has set this
  OutOfDisk            False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:30:14 -0400   KubeletReady                 kubelet is posting ready status

注意：我向大家寻求帮助，想办法让所有节点都看到这一点，Ofegbu的Kenna想出了这个超级有用且可读的宝石：

zsh# nodes=$(kubectl get nodes | sed '1d' | awk '{print $1}') && for node in $nodes; do;  kubectl describe node | sed -n '/Conditions/,/Ready/p' ; done
Bash# (same command, gives errors)

好的，现在我知道节点问题检测器做什么了，但是。。。向节点添加条件有什么好处，如何使用该条件做一些有用的事情

问题：如何使用Kubernetes节点问题检测器？
用例#1：自动修复受损节点
步骤1.）安装节点问题检测器，以便它可以将新的条件元数据附加到节点。
第2步。）利用Planetlabs/draino对条件恶劣的节点进行封锁和排放。
步骤3.）利用自动修复。（当节点处于警戒状态并耗尽时，它将被标记为不可调度，这将触发要配置的新节点，然后坏节点的资源利用率将超低，这将导致坏节点被取消配置）

资料来源：

用例#2：显示不健康节点事件，以便Kubernetes能够检测到它，然后将其输入到监控堆栈中，以便您拥有事件发生和时间的可审核历史记录。
这些不健康的节点事件记录在主机节点的某个位置，但通常情况下，主机节点会生成大量嘈杂/无用的日志数据，因此默认情况下通常不会收集这些事件。
节点问题检测器知道在主机节点上查找这些事件的位置，并在看到负面结果的信号时过滤掉噪声，然后将其发布到pod日志中，而pod日志中没有噪声。
吊舱日志很可能会被麋鹿和普罗米修斯的操作员堆栈吞食，在那里可以检测、报警、存储和绘制图表

另外，请注意，没有什么可以阻止您实现这两个用例

更新，在注释中为每个请求添加node-problem-detector.helm-values.yaml文件片段：

  log_monitors:
#https://github.com/kubernetes/node-problem-detector/tree/master/config contains the full list, you can exec into the pod and ls /config/ to see these as well.
    - /config/abrt-adaptor.json #Adds ABRT Node Events (ABRT: automatic bug reporting tool), exceptions will show up under "kubectl describe node $NODENAME | grep Events -A 20"
    - /config/kernel-monitor.json #Adds 2 new Node Health Condition Checks "KernelDeadlock" and "ReadonlyFilesystem"
    - /config/docker-monitor.json  #Adds new Node Health Condition Check "DockerDaemon" (Checks if Docker is unhealthy as a result of corrupt image)
#    - /config/docker-monitor-filelog.json #Error: "/var/log/docker.log: no such file or directory", doesn't exist on pod, I think you'd have to mount node hostpath to get it to work, gain doesn't sound worth effort.
#    - /config/kernel-monitor-filelog.json #Should add to existing Node Health Check "KernelDeadlock", more thorough detection, but silently fails in NPD pod logs for me.   

  custom_plugin_monitors: #[]
# Someone said all *-counter plugins are custom plugins, if you put them under log_monitors, you'll get #Error: "Failed to unmarshal configuration file "/config/kernel-monitor-counter.json""
    - /config/kernel-monitor-counter.json #Adds new Node Health Condition Check "FrequentUnregisteredNetDevice"
    - /config/docker-monitor-counter.json #Adds new Node Health Condition Check "CorruptDockerOverlay2"
    - /config/systemd-monitor-counter.json #Adds 3 new Node Health Condition Checks "FrequentKubeletRestart", "FrequentDockerRestart", and "FrequentContainerdRestart"

您好，这不是如何安装，而是如何使用探测器的数据/信息。我看不到任何详细的文件，说明我们实际上是如何对探测器提供的信息采取行动的。感谢您的跟进，但我想知道我如何实际使用这个“探测器”以及可以采取什么行动。我是否可以在K8S仪表板中看到一些指标或警报，或者它是否提供端点来提供一些数据？@mon我想是，如和中所示。另请参阅。我有同样的问题，我也想知道安装后如何验证它是否正常工作/做了一些有用的事情/如何测试它，但测试它需要知道如何使用它以及了解它如何工作。网络不可用状态来自weavenet CNI而非NPD。我将对答案进行编辑，以包含node-problem-detector.helm-values.yaml文件中应该包含的内容片段。在为我安装NPD之前或之后，节点条件没有差异。如果您没有看到更改之前和之后的情况。你应该检查NPD pod的日志，确保它没有抱怨rbac权限，如果这样的话，你可以暂时给它的服务帐户管理员rbac权限，否则我会尝试添加更多配置，默认配置可能不足以触发更改。尝试使用我发布的一些配置代码段，看看这是否会导致前后冲突。否则别担心，我发现这个工具很难安装/配置，而且没有很多回报。