Amazon ec2 Coreos震源组在自动缩放后不工作

Amazon ec2 Coreos震源组在自动缩放后不工作,amazon-ec2,autoscaling,coreos,Amazon Ec2,Autoscaling,Coreos,我有3个AWS ec2实例的CoreOS集群。集群是使用CoreOS堆栈云信息设置的。集群启动并运行后,我需要更新自动缩放策略以获取ec2实例概要文件。我复制了现有的自动缩放配置文件,并更新了ec2s的IAM角色。然后,我终止了舰队中的EC2,让自动扩展启动新实例。新实例确实承担了它们的新角色,但是,集群似乎丢失了集群计算机信息: ip-10-214-156-29 ~ # systemctl -l status etcd.service ● etcd.service - etcd Load

我有3个AWS ec2实例的CoreOS集群。集群是使用CoreOS堆栈云信息设置的。集群启动并运行后,我需要更新自动缩放策略以获取ec2实例概要文件。我复制了现有的自动缩放配置文件,并更新了ec2s的IAM角色。然后,我终止了舰队中的EC2,让自动扩展启动新实例。新实例确实承担了它们的新角色,但是,集群似乎丢失了集群计算机信息:

ip-10-214-156-29 ~ # systemctl -l status etcd.service
● etcd.service - etcd
   Loaded: loaded (/usr/lib64/systemd/system/etcd.service; disabled)
  Drop-In: /run/systemd/system/etcd.service.d
       └─10-oem.conf, 20-cloudinit.conf
   Active: activating (auto-restart) (Result: exit-code) since Wed 2014-09-24 18:28:58 UTC; 9s ago
  Process: 14124 ExecStart=/usr/bin/etcd (code=exited, status=1/FAILURE)
 Main PID: 14124 (code=exited, status=1/FAILURE)

Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: etcd.service: main process  exited, code=exited, status=1/FAILURE
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: Unit etcd.service entered failed state.
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 INFO      | d9a7cb8df4a049689de452b6858399e9 attempted to join via 10.252.78.43:7001 failed: fail checking join version: Client Internal Error (Get http://10.252.78.43:7001/version: dial tcp 10.252.78.43:7001: connection refused)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 WARNING   | d9a7cb8df4a049689de452b6858399e9 cannot connect to existing peers [10.214.135.35:7001 10.16.142.108:7001 10.248.7.66:7001 10.35.142.159:7001 10.252.78.43:7001]: fail joining the cluster via given peers after 3 retries
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 CRITICAL  | fail joining the cluster via given peers after 3 retries
从cloud init使用了相同的令牌。https://discovery.etcd.io/ 显示6台机器,其中3台死机,3台新机。看起来有3个新实例加入了集群。journal-u etcd.service日志显示,etcd在死实例上超时,并且新实例的连接被拒绝

journal -u etcd.service shows: 
...

Sep 24 06:01:11 ip-10-35-142-159.us-west-2.compute.internal etcd[574]: [etcd] Sep 24 06:01:11.198 INFO      | 5c4531d885df4d06ae2d369c94f4de11 attempted to join via 10.214.156.29:7001 failed: fail checking join version: Client Internal Error (Get http://10.214.156.29:7001/version: dial tcp 10.214.156.29:7001: connection refused)

etcdctl --debug  ls
Cluster-Peers: http://127.0.0.1:4001 http://10.35.142.159:4001
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?     consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Error:  501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
也许这不是更新集群配置的正确过程,但是如果集群出于任何原因(例如,负载触发)确实需要自动扩展,那么集群是否仍然能够在池中混合使用死实例和新实例

如何在不拆除和重建的情况下从这种情况中恢复


雪山

在该方案中,etcd将不会保留在足够数量的机器上,并且无法成功运行。进行自动校准的最佳方案是设置两组机器:

  • 固定数量(1-9)的etcd机器将始终处于运行状态。这些都是通过发现令牌或与正常情况类似的静态网络进行设置的
  • 您的自动缩放组不启动etcd,而是将震源组(和任何其他工具)配置为使用固定etcd群集。您可以在云配置中执行此操作。下面的示例还设置了一些震源组元数据,以便您可以根据需要专门为自动缩放的机器安排作业:

  • 验证器不允许我在我的答案(wtf!?)中输入任何
    10.x
    IP地址,因此请务必替换这些地址。

    您必须至少有一台机器始终使用发现令牌运行,一旦所有机器都停止运行,心跳就会失败,没有新的主机能够加入,您需要一个新的令牌才能加入群集

    #cloud-config
    coreos:
      fleet:
        metadata: "role=autoscale"
        etcd_servers: "http://:4001,http://:4001,http://:4001,http://:4001,http://:4001,http://:4001"
      units:
        - name: fleet.service
          command: start