Cluster computing ArangoDB DCOS:写入数据时协调器死亡

Cluster computing ArangoDB DCOS:写入数据时协调器死亡,cluster-computing,mesos,arangodb,dcos,Cluster Computing,Mesos,Arangodb,Dcos,我已经设置了一个DCOS集群并安装了arangodb mesos框架(我没有更改初始配置)。我可以通过arangodb代理访问端口8529上的web界面,我可以在那里创建数据库、集合和文档 现在我尝试使用java驱动程序(3.1.4)导入一些数据。在2-3次呼叫后,协调器下降。Mesos会重新启动它,但只要我发送数据,它就会在几次请求后立即消失(我也会在webinterface上失去连接几秒钟): 我的insert基本上只是一个create语句: arangoDriver.graphCreate

我已经设置了一个DCOS集群并安装了arangodb mesos框架(我没有更改初始配置)。我可以通过arangodb代理访问端口8529上的web界面,我可以在那里创建数据库、集合和文档

现在我尝试使用java驱动程序(3.1.4)导入一些数据。在2-3次呼叫后,协调器下降。Mesos会重新启动它,但只要我发送数据,它就会在几次请求后立即消失(我也会在webinterface上失去连接几秒钟):

我的insert基本上只是一个create语句:

arangoDriver.graphCreateVertex(GRAPH_NAME, VERTEX_COLLECTION,
                            getId(), this, true);
ArangoDB代理还抱怨:

I0109 11:26:45.046947 113285 exec.cpp:161] Version: 1.0.1
I0109 11:26:45.051712 113291 exec.cpp:236] Executor registered on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 11:26:45.052942 113293 docker.cpp:815] Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 134217728 -e MARATHON_APP_VERSION=2017-01-09T10:26:29.819Z -e HOST=172.16.100.99 -e MARATHON_APP_RESOURCE_CPUS=1.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e MARATHON_APP_DOCKER_IMAGE=arangodb/arangodb-mesos-haproxy -e MESOS_TASK_ID=arangodb-proxy.16604c72-d656-11e6-80d4-70b3d5800001 -e PORT=8529 -e MARATHON_APP_RESOURCE_MEM=128.0 -e PORTS=8529 -e MARATHON_APP_RESOURCE_DISK=0.0 -e PORT_80=8529 -e MARATHON_APP_LABELS= -e MARATHON_APP_ID=/arangodb-proxy -e PORT0=8529 -e LIBPROCESS_IP=172.16.100.99 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.e0db1925-ff85-4454-bd7e-e0f46e502631 -v /var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0000/executors/arangodb-proxy.16604c72-d656-11e6-80d4-70b3d5800001/runs/e0db1925-ff85-4454-bd7e-e0f46e502631:/mnt/mesos/sandbox --net bridge -p 8529:80/tcp --entrypoint /bin/sh --name mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.e0db1925-ff85-4454-bd7e-e0f46e502631 arangodb/arangodb-mesos-haproxy -c nodejs /configurator.js arangodb3
{ [Error: connect ECONNREFUSED 172.16.100.98:1891]
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '172.16.100.98',
  port: 1891 }
{ [Error: connect ECONNREFUSED 172.16.100.99:10413]
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '172.16.100.99',
  port: 10413 }
我可以在arangodb service completed task(arangodb服务已完成任务)列表中看到失败的任务,但stderr日志似乎没有说明什么:


I0109 16:28:31.792980 126177 exec.cpp:161] Version: 1.0.1
I0109 16:28:31.797145 126182 exec.cpp:236] Executor registered on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 16:28:31.798338 126183 docker.cpp:815] Running docker -H unix:///var/run/docker.sock run --cpu-shares 1024 --memory 4294967296 -e CLUSTER_ROLE=coordinator -e CLUSTER_ID=Coordinator002 -e ADDITIONAL_ARGS= -e AGENCY_ENDPOINTS=tcp://172.16.100.97:1025 tcp://172.16.100.99:1025 tcp://172.16.100.98:1025 -e HOST=172.16.100.99 -e PORT0=1027 -e LIBPROCESS_IP=172.16.100.99 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_CONTAINER_NAME=mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.61fb92b9-62e0-48b2-b2a3-3dc0b95f7818 -v /var/lib/mesos/slave/slaves/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2/frameworks/b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049/executors/0c81c1f2-0404-4943-ab81-0abc78763140/runs/61fb92b9-62e0-48b2-b2a3-3dc0b95f7818:/mnt/mesos/sandbox --net bridge -p 1027:8529/tcp --name mesos-b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2.61fb92b9-62e0-48b2-b2a3-3dc0b95f7818 arangodb/arangodb-mesos:3.1
Mesos日志表示任务失败:

I0109 16:55:44.821689 13431 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2@172.16.100.99:2273
I0109 16:55:45.313108 13431 master.cpp:5466] Performing explicit task state reconciliation for 1 tasks of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0003 (marathon-user) at scheduler-f4e239f5-3249-4b48-9bae-24c1e3d3152c@172.16.100.98:42099
I0109 16:55:45.560523 13428 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141655 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2@172.16.100.99:2273
I0109 16:55:45.676347 13431 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.106:42540 with User-Agent='python-requests/2.10.0'
I0109 16:55:45.823482 13425 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39@172.16.100.107:44838
I0109 16:55:45.823698 13425 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.105:34374 with User-Agent='python-requests/2.10.0'
I0109 16:55:45.824986 13425 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141656 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39@172.16.100.107:44838
I0109 16:55:45.826448 13425 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.107:41364 with User-Agent='python-requests/2.10.0'
I0109 16:55:46.694202 13425 master.cpp:5140] Status update TASK_FAILED (UUID: 2abcbe87-e1d6-4968-965d-33429573dfd9) for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 from agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2 at slave(1)@172.16.100.99:5051 (172.16.100.99)
I0109 16:55:46.694247 13425 master.cpp:5202] Forwarding status update TASK_FAILED (UUID: 2abcbe87-e1d6-4968-965d-33429573dfd9) for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049
I0109 16:55:46.694344 13425 master.cpp:6844] Updating the state of task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (latest state: TASK_FAILED, status update state: TASK_FAILED)
I0109 16:55:46.695953 13425 master.cpp:4265] Processing ACKNOWLEDGE call 2abcbe87-e1d6-4968-965d-33429573dfd9 for task 32014e7f-7f5b-4fea-b757-cca0faa3deac of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2@172.16.100.99:2273 on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2
I0109 16:55:46.695989 13425 master.cpp:6910] Removing task 32014e7f-7f5b-4fea-b757-cca0faa3deac with resources mem(*):4096; cpus(*):1; disk(*):1024; ports(*):[1027-1027] of framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 on agent b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S2 at slave(1)@172.16.100.99:5051 (172.16.100.99)
I0109 16:55:46.824192 13430 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2@172.16.100.99:2273
I0109 16:55:46.824347 13430 master.cpp:5728] Sending 1 offers to framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39@172.16.100.107:44838
I0109 16:55:46.825814 13425 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141658 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0001 (metronome) at scheduler-07a8a68b-8942-46e5-af63-d404785f3f39@172.16.100.107:44838
I0109 16:55:47.567651 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O141657 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-a39fa06a-627c-48bc-bd59-89e55d2ebaa2@172.16.100.99:2273
我在单个arangodb实例上运行了相同的系统导入任务,没有任何问题,因此我假设问题不在java代码中。但我找不到任何进一步的日志来指出问题可能在哪里(但我对Mesos还是相当陌生)

我的群集:

4个从机(126GB RAM,每个8个CPU) 3个主机(32 GB RAM,4个CPU)

谁能告诉我我做错了什么,或者在哪里可以找到更多的日志信息

更新:stdout仅显示启动日志消息(查看stderr(上面)/stdout(下面)中的两个条目的时间戳是从任务启动时开始的,现在是在任务失败时):

我在stderr中注意到arangodb3任务的一些输出,但我不确定这只是记录请求还是问题的一部分-每隔几秒钟就会重复一次:

com.arangodb.ArangoException: org.apache.http.NoHttpResponseException: 172.16.100.99:8529 failed to respond
I0110 08:50:25.981262    23 HttpServer.cpp:456] handling http request 'GET /v1/endpoints.json'
I0110 08:50:26.000558    22 CaretakerCluster.cpp:470] And here the offer:
{"id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173464"},"framework_id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049"},"slave_id":{"value":"b8bd75d7-b013-4a60-8644-94cf7d8e53bb-S0"},"hostname":"172.16.100.97","url":{"scheme":"http","address":{"hostname":"172.16.100.97","ip":"172.16.100.97","port":5051},"path":"/slave(1)","query":[]},"resources":[{"name":"ports","type":1,"ranges":{"range":[{"begin":1028,"end":2180},{"begin":2182,"end":3887},{"begin":3889,"end":5049},{"begin":5052,"end":8079},{"begin":8082,"end":8180},{"begin":8182,"end":8528},{"begin":8530,"end":32000}]},"role":"*"},{"name":"disk","type":0,"scalar":{"value":291730},"role":"*"},{"name":"cpus","type":0,"scalar":{"value":4.75},"role":"*"},{"name":"mem","type":0,"scalar":{"value":117367},"role":"*"}],"attributes":[],"executor_ids":[]}
更新2:另外,在线登录/mesos向我显示了这一点-这是否意味着集群没有正确启动

I0110 09:55:30.857897 13427 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.107:60624 with User-Agent='python-requests/2.10.0'
I0110 09:55:31.111609 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173624 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-b97d05bc-d50c-49e6-8ef1-a9ff324fd2ec@172.16.100.98:9546
I0110 09:55:31.111747 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173623 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-b97d05bc-d50c-49e6-8ef1-a9ff324fd2ec@172.16.100.98:9546
I

你能检查一下这个任务是否有一些标准日志吗?应该有一些东西。@mop我已经添加了stdout输出,但我不认为它告诉我们太多-这只是启动日志。仍然可以访问docker容器吗?i、 e.您能否访问mesos代理并找到任务的退出代码。此外,您是否可以尝试对集装箱发出docker inspect?我最初的希望是arangodb做了一些注销。也许docker州给了我们一些提示,比如进程因为OOM而终止,或者其他什么。我投票结束这个问题,因为这是一个github问题,而不是stackoverflow上的什么问题。
I0110 09:55:30.857897 13427 http.cpp:381] HTTP GET for /master/state-summary from 172.16.100.107:60624 with User-Agent='python-requests/2.10.0'
I0110 09:55:31.111609 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173624 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-b97d05bc-d50c-49e6-8ef1-a9ff324fd2ec@172.16.100.98:9546
I0110 09:55:31.111747 13426 master.cpp:3954] Processing DECLINE call for offers: [ b8bd75d7-b013-4a60-8644-94cf7d8e53bb-O173623 ] for framework b8bd75d7-b013-4a60-8644-94cf7d8e53bb-0049 (arangodb3) at scheduler-b97d05bc-d50c-49e6-8ef1-a9ff324fd2ec@172.16.100.98:9546
I