Kubernetes 按名称查找pod时主机未知,通过pod重新启动解决
我有一个安装程序,它在我的CI流中旋转两个pod,让我们称它们为web和activemq。当web pod启动时,它尝试使用k8s分配的amq-deployment-0.activemq pod名称与activemq pod通信 在尝试访问amq-deployment1.activemq时,web将随机获得未知主机异常。如果在这种情况下重新启动web pod,web pod与activemq pod的通信将不会出现问题 发生这种情况时,我已经登录到WebPod,/etc/resolv.conf和/etc/hosts文件看起来很好。主机/etc/resolve.conf和/etc/hosts是稀疏的,没有任何可疑之处 资料: 只有一个工作节点 kubectl——版本 Kubernetes v1.8.3+icp+ee 关于如何调试此问题的任何想法。我想不出一个好的理由让它随机发生,也无法在pod重启时自行解决 如果需要其他有用的信息,我可以得到。提前感谢 对于activeMQ,我们有此服务文件Kubernetes 按名称查找pod时主机未知,通过pod重新启动解决,kubernetes,unknown-host,Kubernetes,Unknown Host,我有一个安装程序,它在我的CI流中旋转两个pod,让我们称它们为web和activemq。当web pod启动时,它尝试使用k8s分配的amq-deployment-0.activemq pod名称与activemq pod通信 在尝试访问amq-deployment1.activemq时,web将随机获得未知主机异常。如果在这种情况下重新启动web pod,web pod与activemq pod的通信将不会出现问题 发生这种情况时,我已经登录到WebPod,/etc/resolv.conf和
apiVersion: v1 kind: Service
metadata:
name: activemq
labels:
app: myapp
env: dev
spec:
ports:
- port: 8161
protocol: TCP
targetPort: 8161
name: http
- port: 61616
protocol: TCP
targetPort: 61616
name: amq
selector:
component: analytics-amq
app: myapp
environment: dev
type: fa-core
clusterIP: None
这个ActiveMQ有状态集(这是模板)
Web有状态集:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: pa-web-deployment
spec:
replicas: 1
updateStrategy:
type: RollingUpdate
serviceName: "pa-web"
template:
metadata:
labels:
component: analytics-web
app: myapp
environment: dev
type: fa-core
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- analytics-web
topologyKey: kubernetes.io/hostname
containers:
- name: pa-web
image: default/myco/web:latest
imagePullPolicy: Always
resources:
limits:
cpu: 1
memory: 2Gi
readinessProbe:
httpGet:
path: /versions
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 76
livenessProbe:
httpGet:
path: /versions
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 80
securityContext:
privileged: true
ports:
- containerPort: 8080
name: http
protocol: TCP
envFrom:
- configMapRef:
name: pa-web-conf-all
- secretRef:
name: pa-web-secret
volumeMounts:
- name: shared-volume
mountPath: /MySharedPath
- name: timezone
mountPath: /etc/localtime
volumes:
- nfs:
server: 10.100.10.23
path: /MySharedPath
name: shared-volume
- name: timezone
hostPath:
path: /usr/share/zoneinfo/UTC
这个WebPod在查找我们配置的外部数据库时也遇到类似的“未知主机”问题。该问题通过重启pod得到类似的解决。下面是该外部服务的配置。也许从这个角度解决问题更容易些?ActiveMQ在使用数据库服务名称查找数据库和启动时没有问题
apiVersion: v1
kind: Service
metadata:
name: dbhost
labels:
app: myapp
env: dev
spec:
type: ExternalName
externalName: mydb.host.com
有没有可能是哪个吊舱及其容器中的应用程序先启动,然后再启动的问题 在任何情况下,由于Kubernetes分配的pod名称在pod重启之间发生变化,建议使用a而不是pod名称进行连接 测试连接性的一种方法是使用
telnet
(或其支持的协议),如果在图像中找到:
telnet <host/pod/Service> <port>
telnet
无法找到解决方案,我创建了一个变通方法。我在映像中设置entrypoint.sh以查找需要访问并写入日志的域,出现错误时退出:
#!/bin/bash
#disable echo and exit on error
set +ex
#####################################
# verfiy that the db service can be found or exit container
#####################################
# we do not want to install nslookup to determine if the db_host_name is valid name
# we have ping available though
# 0-success, 1-error pinging but lookup worked (services can not be pinged), 2-unreachable host
ping -W 2 -c 1 ${db_host_name} &> /dev/null
if [ $? -le 1 ]
then
echo "service ${db_host_name} is known"
else
echo "${db_host_name} service is NOT recognized. Exiting container..."
exit 1
fi
下一步,因为只有pod重启修复了该问题。在我的ansible部署中,我进行了一次卷展检查,查询日志以查看是否需要重新启动pod。例如:
卷展栏检查.yml
- name: "Rollout status for {{rollout_item.statefulset}}"
shell: timeout 4m kubectl rollout status -n {{fa_namespace}} -f {{ rollout_item.statefulset }}
ignore_errors: yes
# assuming that the first pod will be the one that would have an issue
- name: "Get {{rollout_item.pod_name}} log to check for issue with dns lookup"
shell: kubectl logs {{rollout_item.pod_name}} --tail=1 -n {{fa_namespace}}
register: log_line
# the entrypoint will write dbhost service is NOT recognized. Exiting container... to the log
# if there is a problem getting to the dbhost
- name: "Try removing {{rollout_item.component}} pod if unable to deploy"
shell: kubectl delete pods -l component={{rollout_item.component}} --force --grace-period=0 --ignore-not-found=true -n {{fa_namespace}}
when: log_line.stdout.find('service is NOT recognized') > 0
我重复了6次此卷展检查,因为有时即使在pod重新启动后也找不到服务。一旦pod成功启动,附加检查立即生效
- name: "Web rollout"
include_tasks: rollout-check.yml
loop:
- { c: 1, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 2, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 3, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 4, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 5, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 6, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
loop_control:
loop_var: rollout_item
你能分享吊舱及其服务的配置吗?完成。如果还有什么可以帮忙的,请告诉我。谢谢您有必要使用StatefulSet而不是部署吗?另外,我在配置中发现了一个错误:在服务中,您有选择器标签
app:flexible analytics
,但在StatfulSet中,您有标签app:myapp
。这可能会导致这样的错误。我们使用有状态集是因为我们希望更好地控制缩放。我已更正应用程序名称,但问题仍然存在。谢谢。我们正在对activeMQ状态集进行“kubectl卷展状态”检查,以确保activeMQ已启动并正在运行,然后再继续web pod卷展kubedns不应该处理分配给POD的名称吗?在我的例子中,pod的名称始终是:pa-amq-deployment-0I应该说我们还有其他容器也连接到同一个activemq实例,并且它们能够很好地连接。在Kubernetes中,pod使用服务的名称彼此通信。根据其选择器,服务点位于POD上。因此,在Kubedn中,您有服务的名称,而不是POD。对于StatefulSet,它是
@chrisaah-StatefulSet,明白了。。。我看到livenessProbe,但没有为pa amq容器定义readinessProbe。这可能会影响“kubectl卷展状态”检查吗?
- name: "Web rollout"
include_tasks: rollout-check.yml
loop:
- { c: 1, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 2, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 3, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 4, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 5, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 6, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
loop_control:
loop_var: rollout_item