Amazon ec2 Coreos震源组在自动缩放后不工作
我有3个AWS ec2实例的CoreOS集群。集群是使用CoreOS堆栈云信息设置的。集群启动并运行后,我需要更新自动缩放策略以获取ec2实例概要文件。我复制了现有的自动缩放配置文件,并更新了ec2s的IAM角色。然后,我终止了舰队中的EC2,让自动扩展启动新实例。新实例确实承担了它们的新角色,但是,集群似乎丢失了集群计算机信息:Amazon ec2 Coreos震源组在自动缩放后不工作,amazon-ec2,autoscaling,coreos,Amazon Ec2,Autoscaling,Coreos,我有3个AWS ec2实例的CoreOS集群。集群是使用CoreOS堆栈云信息设置的。集群启动并运行后,我需要更新自动缩放策略以获取ec2实例概要文件。我复制了现有的自动缩放配置文件,并更新了ec2s的IAM角色。然后,我终止了舰队中的EC2,让自动扩展启动新实例。新实例确实承担了它们的新角色,但是,集群似乎丢失了集群计算机信息: ip-10-214-156-29 ~ # systemctl -l status etcd.service ● etcd.service - etcd Load
ip-10-214-156-29 ~ # systemctl -l status etcd.service
● etcd.service - etcd
Loaded: loaded (/usr/lib64/systemd/system/etcd.service; disabled)
Drop-In: /run/systemd/system/etcd.service.d
└─10-oem.conf, 20-cloudinit.conf
Active: activating (auto-restart) (Result: exit-code) since Wed 2014-09-24 18:28:58 UTC; 9s ago
Process: 14124 ExecStart=/usr/bin/etcd (code=exited, status=1/FAILURE)
Main PID: 14124 (code=exited, status=1/FAILURE)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: Unit etcd.service entered failed state.
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 INFO | d9a7cb8df4a049689de452b6858399e9 attempted to join via 10.252.78.43:7001 failed: fail checking join version: Client Internal Error (Get http://10.252.78.43:7001/version: dial tcp 10.252.78.43:7001: connection refused)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 WARNING | d9a7cb8df4a049689de452b6858399e9 cannot connect to existing peers [10.214.135.35:7001 10.16.142.108:7001 10.248.7.66:7001 10.35.142.159:7001 10.252.78.43:7001]: fail joining the cluster via given peers after 3 retries
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 CRITICAL | fail joining the cluster via given peers after 3 retries
从cloud init使用了相同的令牌。https://discovery.etcd.io/ 显示6台机器,其中3台死机,3台新机。看起来有3个新实例加入了集群。journal-u etcd.service日志显示,etcd在死实例上超时,并且新实例的连接被拒绝
journal -u etcd.service shows:
...
Sep 24 06:01:11 ip-10-35-142-159.us-west-2.compute.internal etcd[574]: [etcd] Sep 24 06:01:11.198 INFO | 5c4531d885df4d06ae2d369c94f4de11 attempted to join via 10.214.156.29:7001 failed: fail checking join version: Client Internal Error (Get http://10.214.156.29:7001/version: dial tcp 10.214.156.29:7001: connection refused)
etcdctl --debug ls
Cluster-Peers: http://127.0.0.1:4001 http://10.35.142.159:4001
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/? consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
也许这不是更新集群配置的正确过程,但是如果集群出于任何原因(例如,负载触发)确实需要自动扩展,那么集群是否仍然能够在池中混合使用死实例和新实例
如何在不拆除和重建的情况下从这种情况中恢复
雪山在该方案中,etcd将不会保留在足够数量的机器上,并且无法成功运行。进行自动校准的最佳方案是设置两组机器:
验证器不允许我在我的答案(wtf!?)中输入任何
10.x
IP地址,因此请务必替换这些地址。您必须至少有一台机器始终使用发现令牌运行,一旦所有机器都停止运行,心跳就会失败,没有新的主机能够加入,您需要一个新的令牌才能加入群集
#cloud-config
coreos:
fleet:
metadata: "role=autoscale"
etcd_servers: "http://:4001,http://:4001,http://:4001,http://:4001,http://:4001,http://:4001"
units:
- name: fleet.service
command: start