Kubernetes 按名称查找pod时主机未知,通过pod重新启动解决

Kubernetes 按名称查找pod时主机未知,通过pod重新启动解决,kubernetes,unknown-host,Kubernetes,Unknown Host,我有一个安装程序,它在我的CI流中旋转两个pod,让我们称它们为web和activemq。当web pod启动时,它尝试使用k8s分配的amq-deployment-0.activemq pod名称与activemq pod通信 在尝试访问amq-deployment1.activemq时,web将随机获得未知主机异常。如果在这种情况下重新启动web pod,web pod与activemq pod的通信将不会出现问题 发生这种情况时,我已经登录到WebPod,/etc/resolv.conf和

我有一个安装程序,它在我的CI流中旋转两个pod,让我们称它们为web和activemq。当web pod启动时,它尝试使用k8s分配的amq-deployment-0.activemq pod名称与activemq pod通信

在尝试访问amq-deployment1.activemq时,web将随机获得未知主机异常。如果在这种情况下重新启动web pod,web pod与activemq pod的通信将不会出现问题

发生这种情况时,我已经登录到WebPod,/etc/resolv.conf和/etc/hosts文件看起来很好。主机/etc/resolve.conf和/etc/hosts是稀疏的,没有任何可疑之处

资料: 只有一个工作节点

kubectl——版本 Kubernetes v1.8.3+icp+ee

关于如何调试此问题的任何想法。我想不出一个好的理由让它随机发生,也无法在pod重启时自行解决

如果需要其他有用的信息,我可以得到。提前感谢

对于activeMQ,我们有此服务文件

apiVersion: v1 kind: Service
metadata:
    name: activemq
    labels:
            app: myapp
            env: dev
spec:
    ports:
        - port: 8161
          protocol: TCP
          targetPort: 8161
          name: http
        - port: 61616
          protocol: TCP
          targetPort: 61616
          name: amq
    selector:
        component: analytics-amq
        app: myapp
        environment: dev
        type: fa-core
    clusterIP: None
这个ActiveMQ有状态集(这是模板)

Web有状态集:

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
    name: pa-web-deployment
spec:
    replicas: 1
    updateStrategy:
        type: RollingUpdate
    serviceName: "pa-web"
    template:
        metadata:
            labels:
                component: analytics-web
                app: myapp
                environment: dev
                type: fa-core
        spec:
            affinity:
              podAntiAffinity:
                preferredDuringSchedulingIgnoredDuringExecution:
                - weight: 100
                  podAffinityTerm:
                    labelSelector:
                      matchExpressions:
                      - key: component
                        operator: In
                        values:
                        - analytics-web
                    topologyKey: kubernetes.io/hostname
            containers:
                - name: pa-web
                  image: default/myco/web:latest
                  imagePullPolicy: Always
                  resources:
                        limits:
                            cpu: 1
                            memory: 2Gi
                  readinessProbe:
                      httpGet:
                          path: /versions
                          port: 8080
                      initialDelaySeconds: 30
                      periodSeconds: 15
                      failureThreshold: 76
                  livenessProbe:
                      httpGet:
                          path: /versions
                          port: 8080
                      initialDelaySeconds: 30
                      periodSeconds: 15
                      failureThreshold: 80
                  securityContext:
                      privileged: true
                  ports:
                      - containerPort: 8080
                        name: http
                        protocol: TCP
                  envFrom:
                      - configMapRef:
                         name: pa-web-conf-all
                      - secretRef:
                         name: pa-web-secret
                  volumeMounts:
                      - name: shared-volume
                        mountPath: /MySharedPath
                      - name: timezone
                        mountPath: /etc/localtime
            volumes:
                - nfs:
                    server: 10.100.10.23
                    path: /MySharedPath
                  name: shared-volume
                - name: timezone
                  hostPath:
                    path: /usr/share/zoneinfo/UTC
这个WebPod在查找我们配置的外部数据库时也遇到类似的“未知主机”问题。该问题通过重启pod得到类似的解决。下面是该外部服务的配置。也许从这个角度解决问题更容易些?ActiveMQ在使用数据库服务名称查找数据库和启动时没有问题

apiVersion: v1
kind: Service
metadata:
  name: dbhost
  labels:
    app: myapp
    env: dev
spec:
  type: ExternalName
  externalName: mydb.host.com

有没有可能是哪个吊舱及其容器中的应用程序先启动,然后再启动的问题

在任何情况下,由于Kubernetes分配的pod名称在pod重启之间发生变化,建议使用a而不是pod名称进行连接

测试连接性的一种方法是使用
telnet
(或其支持的协议),如果在图像中找到:

telnet <host/pod/Service> <port>
telnet

无法找到解决方案,我创建了一个变通方法。我在映像中设置entrypoint.sh以查找需要访问并写入日志的域,出现错误时退出:

#!/bin/bash

#disable echo and exit on error
set +ex

#####################################
# verfiy that the db service can be found or exit container
#####################################
# we do not want to install nslookup to determine if the db_host_name is valid name
# we have ping available though
# 0-success, 1-error pinging but lookup worked (services can not be pinged), 2-unreachable host
ping -W 2 -c 1 ${db_host_name} &> /dev/null
if [ $? -le 1 ]
then
  echo "service ${db_host_name} is known"
else
  echo "${db_host_name} service is NOT recognized. Exiting container..."
  exit 1
fi
下一步,因为只有pod重启修复了该问题。在我的ansible部署中,我进行了一次卷展检查,查询日志以查看是否需要重新启动pod。例如:

卷展栏检查.yml

- name: "Rollout status for {{rollout_item.statefulset}}"
  shell: timeout 4m kubectl rollout status -n {{fa_namespace}} -f {{ rollout_item.statefulset }}
  ignore_errors: yes

# assuming that the first pod will be the one that would have an issue
- name: "Get {{rollout_item.pod_name}} log to check for issue with dns lookup"
  shell: kubectl logs {{rollout_item.pod_name}} --tail=1 -n {{fa_namespace}}
  register: log_line

# the entrypoint will write dbhost service is NOT recognized. Exiting container... to the log
# if there is a problem getting to the dbhost
- name: "Try removing {{rollout_item.component}} pod if unable to deploy"
  shell: kubectl delete pods -l component={{rollout_item.component}} --force --grace-period=0 --ignore-not-found=true -n {{fa_namespace}}
  when: log_line.stdout.find('service is NOT recognized') > 0
我重复了6次此卷展检查,因为有时即使在pod重新启动后也找不到服务。一旦pod成功启动,附加检查立即生效

- name: "Web rollout"
  include_tasks: rollout-check.yml
  loop:
  - { c: 1, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 2, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 3, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 4, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 5, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 6, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  loop_control:
    loop_var: rollout_item

你能分享吊舱及其服务的配置吗?完成。如果还有什么可以帮忙的,请告诉我。谢谢您有必要使用StatefulSet而不是部署吗?另外,我在配置中发现了一个错误:在服务中,您有选择器标签
app:flexible analytics
,但在StatfulSet中,您有标签
app:myapp
。这可能会导致这样的错误。我们使用有状态集是因为我们希望更好地控制缩放。我已更正应用程序名称,但问题仍然存在。谢谢。我们正在对activeMQ状态集进行“kubectl卷展状态”检查,以确保activeMQ已启动并正在运行,然后再继续web pod卷展
kubedns不应该处理分配给POD的名称吗?在我的例子中,pod的名称始终是:pa-amq-deployment-0I应该说我们还有其他容器也连接到同一个activemq实例,并且它们能够很好地连接。在Kubernetes中,pod使用服务的名称彼此通信。根据其选择器,服务点位于POD上。因此,在Kubedn中,您有服务的名称,而不是POD。对于StatefulSet,它是
@chrisaah-StatefulSet,明白了。。。我看到livenessProbe,但没有为pa amq容器定义readinessProbe。这可能会影响“kubectl卷展状态”检查吗?
- name: "Web rollout"
  include_tasks: rollout-check.yml
  loop:
  - { c: 1, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 2, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 3, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 4, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 5, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 6, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  loop_control:
    loop_var: rollout_item