Amazon ec2 Coreos震源组在自动缩放后不工作_Amazon Ec2_Autoscaling_Coreos

Amazon ec2 Coreos震源组在自动缩放后不工作

amazon-ec2

Amazon ec2 Coreos震源组在自动缩放后不工作,amazon-ec2,autoscaling,coreos,Amazon Ec2,Autoscaling,Coreos,我有3个AWS ec2实例的CoreOS集群。集群是使用CoreOS堆栈云信息设置的。集群启动并运行后，我需要更新自动缩放策略以获取ec2实例概要文件。我复制了现有的自动缩放配置文件，并更新了ec2s的IAM角色。然后，我终止了舰队中的EC2，让自动扩展启动新实例。新实例确实承担了它们的新角色，但是，集群似乎丢失了集群计算机信息： ip-10-214-156-29 ~ # systemctl -l status etcd.service ● etcd.service - etcd Load

我有3个AWS ec2实例的CoreOS集群。集群是使用CoreOS堆栈云信息设置的。集群启动并运行后，我需要更新自动缩放策略以获取ec2实例概要文件。我复制了现有的自动缩放配置文件，并更新了ec2s的IAM角色。然后，我终止了舰队中的EC2，让自动扩展启动新实例。新实例确实承担了它们的新角色，但是，集群似乎丢失了集群计算机信息：

ip-10-214-156-29 ~ # systemctl -l status etcd.service
● etcd.service - etcd
   Loaded: loaded (/usr/lib64/systemd/system/etcd.service; disabled)
  Drop-In: /run/systemd/system/etcd.service.d
       └─10-oem.conf, 20-cloudinit.conf
   Active: activating (auto-restart) (Result: exit-code) since Wed 2014-09-24 18:28:58 UTC; 9s ago
  Process: 14124 ExecStart=/usr/bin/etcd (code=exited, status=1/FAILURE)
 Main PID: 14124 (code=exited, status=1/FAILURE)

Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: etcd.service: main process  exited, code=exited, status=1/FAILURE
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: Unit etcd.service entered failed state.
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 INFO      | d9a7cb8df4a049689de452b6858399e9 attempted to join via 10.252.78.43:7001 failed: fail checking join version: Client Internal Error (Get http://10.252.78.43:7001/version: dial tcp 10.252.78.43:7001: connection refused)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 WARNING   | d9a7cb8df4a049689de452b6858399e9 cannot connect to existing peers [10.214.135.35:7001 10.16.142.108:7001 10.248.7.66:7001 10.35.142.159:7001 10.252.78.43:7001]: fail joining the cluster via given peers after 3 retries
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 CRITICAL  | fail joining the cluster via given peers after 3 retries

从cloud init使用了相同的令牌。https://discovery.etcd.io/ 显示6台机器，其中3台死机，3台新机。看起来有3个新实例加入了集群。journal-u etcd.service日志显示，etcd在死实例上超时，并且新实例的连接被拒绝

journal -u etcd.service shows: 
...

Sep 24 06:01:11 ip-10-35-142-159.us-west-2.compute.internal etcd[574]: [etcd] Sep 24 06:01:11.198 INFO      | 5c4531d885df4d06ae2d369c94f4de11 attempted to join via 10.214.156.29:7001 failed: fail checking join version: Client Internal Error (Get http://10.214.156.29:7001/version: dial tcp 10.214.156.29:7001: connection refused)

etcdctl --debug  ls
Cluster-Peers: http://127.0.0.1:4001 http://10.35.142.159:4001
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?     consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Error:  501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

也许这不是更新集群配置的正确过程，但是如果集群出于任何原因（例如，负载触发）确实需要自动扩展，那么集群是否仍然能够在池中混合使用死实例和新实例

如何在不拆除和重建的情况下从这种情况中恢复

雪山

在该方案中，etcd将不会保留在足够数量的机器上，并且无法成功运行。进行自动校准的最佳方案是设置两组机器：

固定数量（1-9）的etcd机器将始终处于运行状态。这些都是通过发现令牌或与正常情况类似的静态网络进行设置的

您的自动缩放组不启动etcd，而是将震源组（和任何其他工具）配置为使用固定etcd群集。您可以在云配置中执行此操作。下面的示例还设置了一些震源组元数据，以便您可以根据需要专门为自动缩放的机器安排作业：

验证器不允许我在我的答案（wtf！？）中输入任何

10.x

IP地址，因此请务必替换这些地址。

您必须至少有一台机器始终使用发现令牌运行，一旦所有机器都停止运行，心跳就会失败，没有新的主机能够加入，您需要一个新的令牌才能加入群集

#cloud-config
coreos:
  fleet:
    metadata: "role=autoscale"
    etcd_servers: "http://:4001,http://:4001,http://:4001,http://:4001,http://:4001,http://:4001"
  units:
    - name: fleet.service
      command: start