Python 如何在docker swarm中使用ray
我正试图用docker swarm建立一个有一个射线头和两个射线工作者的集群。我有三台这样的机器,一台运行ray had,两台分别是ray worker。集群正常运行,但每当我执行到容器中并运行时:Python 如何在docker swarm中使用ray,python,docker,dockerfile,docker-swarm,ray,Python,Docker,Dockerfile,Docker Swarm,Ray,我正试图用docker swarm建立一个有一个射线头和两个射线工作者的集群。我有三台这样的机器,一台运行ray had,两台分别是ray worker。集群正常运行,但每当我执行到容器中并运行时: import ray ray.init(redis-address='ray-head:6379') 我明白了 WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registe
import ray
ray.init(redis-address='ray-head:6379')
我明白了
WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
容器的日志ḱ 嗯
我也有IPs,机器和光头容器的ip
ray.init(redis-address='192.168.30.193:6379')
运行时:
telnet 192.168.30.193 6379
有一个答案
容器的Dockerfile:
FROM python:2.7-slim
RUN apt-get -y update
RUN apt-get install -y --fix-missing \
libxml2 \
gcc \
vim \
iputils-ping \
telnet \
procps \
&& apt-get clean && rm -rf /tmp/* /var/tmp/*
RUN pip install ray
CMD ["echo", "Base Image Ready"]
docker-compose.yml
version: "3.5"
services:
ray-head:
image: simpled:0.1
shm_size: '2gb'
entrypoint: [ '/usr/local/bin/ray']
command: ['start', '--head', '--redis-port', '6379', '--redis-shard-ports','6380,6381', '--object-manager-port','12345', '--node-manager-port','12346', '--node-ip-address', 'ray-head', '--block']
ports:
- target: 6379
published: 6379
protocol: tcp
mode: host
- target: 6380
published: 6380
protocol: tcp
mode: host
- target: 6381
published: 6381
protocol: tcp
mode: host
- target: 12345
published: 12345
protocol: tcp
mode: host
- target: 12346
published: 12346
protocol: tcp
mode: host
deploy:
replicas: 1
placement:
constraints: [node.labels.Head == true ]
ray-worker:
image: simpled:0.1
shm_size: '2gb'
entrypoint: [ '/usr/local/bin/ray']
command: ['start', '--node-ip-address', 'ray-worker', '--redis-address', 'ray-head:6379', '--object-manager-port', '12345', '--node-manager-port', '12346', '--block']
ports:
- target: 12345
published: 12345
protocol: tcp
mode: host
- target: 12346
published: 12346
protocol: tcp
mode: host
depends_on:
- "ray-head"
deploy:
mode: global
placement:
constraints: [node.labels.Head != true]
我做错了吗?任何让它在群集模式下工作的人
编辑2019-04-14
头部日志:
[root@ray-node-1 bd-migratie-core]# docker service logs qaudt0j3clfv
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | 2019-04-14 17:49:34,187 INFO scripts.py:288 -- Using IP address 10.0.30.2 for this node.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | 2019-04-14 17:49:34,190 INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-14_17-49-34_1/logs.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | 2019-04-14 17:49:34,323 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6379 to respond...
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | 2019-04-14 17:49:34,529 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6380 to respond...
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | 2019-04-14 17:49:34,538 INFO services.py:760 -- Starting Redis shard with 0.74 GB max memory.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | 2019-04-14 17:49:34,704 INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6381 to respond...
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | 2019-04-14 17:49:34,714 INFO services.py:760 -- Starting Redis shard with 0.74 GB max memory.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | 2019-04-14 17:49:34,859 WARNING services.py:1261 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | 2019-04-14 17:49:34,862 INFO services.py:1384 -- Starting the Plasma object store with 1.11 GB memory using /tmp.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | 2019-04-14 17:49:34,997 INFO scripts.py:319 --
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | Started Ray on this node. You can add additional nodes to the cluster by calling
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se |
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | ray start --redis-address 10.0.30.2:6379
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se |
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | from the node you wish to add. You can connect a driver to the cluster from Python by running
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se |
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | import ray
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | ray.init(redis_address="10.0.30.2:6379")
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se |
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se |
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se | ray stop
ps aux内封头容器:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.2 1.9 289800 70860 ? Ss 17:49 0:01 /usr/local/bin/python /usr/local/bin/ray start --head --redis-port 6379 --redis-shard-ports 6380,6381 --object-manager-port 12345 --node-manager-port 12346 --node-ip-addres
root 9 0.9 1.4 182352 50920 ? Rl 17:49 0:05 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6379
root 14 0.8 1.3 182352 48828 ? Rl 17:49 0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6380
root 18 0.5 1.4 188496 52320 ? Sl 17:49 0:03 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6381
root 22 3.1 1.9 283144 70132 ? S 17:49 0:17 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/monitor.py --redis-address=10.0.30.2:6379
root 23 0.7 0.0 15736 1852 ? S 17:49 0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet_monitor 10.0.30.2 6379
root 25 0.0 0.0 1098804 1528 ? S 17:49 0:00 /usr/local/lib/python2.7/site-packages/ray/core/src/plasma/plasma_store_server -s /tmp/ray/session_2019-04-14_17-49-34_1/sockets/plasma_store -m 1111605657 -d /tmp
root 26 0.5 0.0 32944 2524 ? Sl 17:49 0:03 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet /tmp/ray/session_2019-04-14_17-49-34_1/sockets/raylet /tmp/ray/session_2019-04-14_17-49-34_1/sockets/p
root 27 1.1 0.9 246340 35192 ? S 17:49 0:06 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/log_monitor.py --redis-address=10.0.30.2:6379 --logs-dir=/tmp/ray/session_2019-04-14_17-49-34_1/logs
root 31 2.7 0.9 385800 35368 ? Sl 17:49 0:15 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.2 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root 32 2.7 0.9 385800 35364 ? Sl 17:49 0:15 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.2 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root 48 2.2 0.0 19944 2232 pts/0 Ss 17:59 0:00 bash
root 53 0.0 0.0 38376 1644 pts/0 R+ 17:59 0:00 ps aux
工人日志:
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,716 INFO services.py:363 -- Waiting for redis server at 10.0.30.2:6379 to respond...
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,733 INFO scripts.py:363 -- Using IP address 10.0.30.5 for this node.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,748 INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-14_17-49-35_1/logs.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,794 WARNING services.py:1261 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,796 INFO services.py:1384 -- Starting the Plasma object store with 1.11 GB memory using /tmp.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,894 INFO scripts.py:371 --
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | Started Ray on this node. If you wish to terminate the processes that have been started, run
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se |
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | ray stop
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.1 1.9 292524 70900 ? Ss 17:49 0:01 /usr/local/bin/python /usr/local/bin/ray start --node-ip-address ray-worker --redis-address ray-head:6379 --object-manager-port 12345 --node-manager-port 12346 --block
root 10 0.0 0.0 1098804 1532 ? S 17:49 0:00 /usr/local/lib/python2.7/site-packages/ray/core/src/plasma/plasma_store_server -s /tmp/ray/session_2019-04-14_17-49-35_1/sockets/plasma_store -m 1111605657 -d /tmp
root 11 0.5 0.0 32944 2520 ? Sl 17:49 0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet /tmp/ray/session_2019-04-14_17-49-35_1/sockets/raylet /tmp/ray/session_2019-04-14_17-49-35_1/sockets/p
root 12 0.8 0.9 246320 35192 ? S 17:49 0:06 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/log_monitor.py --redis-address=10.0.30.2:6379 --logs-dir=/tmp/ray/session_2019-04-14_17-49-35_1/logs
root 15 2.7 0.9 385800 35368 ? Sl 17:49 0:19 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.5 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root 16 2.7 0.9 385800 35360 ? Sl 17:49 0:19 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.5 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root 39 4.5 0.0 19944 2236 pts/0 Ss 18:01 0:00 bash
root 44 0.0 0.0 38376 1648 pts/0 R+ 18:01 0:00 ps aux
工人私人秘书:
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,716 INFO services.py:363 -- Waiting for redis server at 10.0.30.2:6379 to respond...
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,733 INFO scripts.py:363 -- Using IP address 10.0.30.5 for this node.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,748 INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-14_17-49-35_1/logs.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,794 WARNING services.py:1261 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,796 INFO services.py:1384 -- Starting the Plasma object store with 1.11 GB memory using /tmp.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | 2019-04-14 17:49:35,894 INFO scripts.py:371 --
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | Started Ray on this node. If you wish to terminate the processes that have been started, run
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se |
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se | ray stop
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.1 1.9 292524 70900 ? Ss 17:49 0:01 /usr/local/bin/python /usr/local/bin/ray start --node-ip-address ray-worker --redis-address ray-head:6379 --object-manager-port 12345 --node-manager-port 12346 --block
root 10 0.0 0.0 1098804 1532 ? S 17:49 0:00 /usr/local/lib/python2.7/site-packages/ray/core/src/plasma/plasma_store_server -s /tmp/ray/session_2019-04-14_17-49-35_1/sockets/plasma_store -m 1111605657 -d /tmp
root 11 0.5 0.0 32944 2520 ? Sl 17:49 0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet /tmp/ray/session_2019-04-14_17-49-35_1/sockets/raylet /tmp/ray/session_2019-04-14_17-49-35_1/sockets/p
root 12 0.8 0.9 246320 35192 ? S 17:49 0:06 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/log_monitor.py --redis-address=10.0.30.2:6379 --logs-dir=/tmp/ray/session_2019-04-14_17-49-35_1/logs
root 15 2.7 0.9 385800 35368 ? Sl 17:49 0:19 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.5 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root 16 2.7 0.9 385800 35360 ? Sl 17:49 0:19 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.5 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root 39 4.5 0.0 19944 2236 pts/0 Ss 18:01 0:00 bash
root 44 0.0 0.0 38376 1648 pts/0 R+ 18:01 0:00 ps aux
编辑2019-04-17
我知道它现在不起作用的原因,但不知道如何修复它
如果我登录到head容器并检查运行ray进程的ip
ray/monitor.py --redis-address=10.0.30.5:6379
这和
/# ping ray-head
PING ray-head (10.0.30.5) 56(84) bytes of data.
64 bytes from 10.0.30.5 (10.0.30.5): icmp_seq=1 ttl=64 time=0.105 ms
但它不匹配
/hostname -i
10.0.30.6
如果我移动光线进程以--redis address=10.0.30.6:6379开始
它可以工作。我找到了修复它的方法: ray head容器的主机名不是“ray head”,而是“tasks.ray head” 为了使其正常工作,我需要更改docker compose文件中的主机名,如下所示: 对于射线头:
command: ['start', '--head', '--redis-port', '6379', '--redis-shard-ports','6380,6381', '--object-manager-port','12345', '--node-manager-port','12346', '--node-ip-address', 'tasks.ray-head', '--block']
对于ray worker:
command: ['start', '--redis-address', 'tasks.ray-head:6379', '--object-manager-port', '12345', '--node-manager-port', '12346', '--block']
现在,我可以在任何主机上运行此操作:
ray.init('tasks.ray-head:6379')
我希望这能帮助其他处于同样情况的人我找到了解决方法: ray head容器的主机名不是“ray head”,而是“tasks.ray head” 为了使其正常工作,我需要更改docker compose文件中的主机名,如下所示: 对于射线头:
command: ['start', '--head', '--redis-port', '6379', '--redis-shard-ports','6380,6381', '--object-manager-port','12345', '--node-manager-port','12346', '--node-ip-address', 'tasks.ray-head', '--block']
对于ray worker:
command: ['start', '--redis-address', 'tasks.ray-head:6379', '--object-manager-port', '12345', '--node-manager-port', '12346', '--block']
现在,我可以在任何主机上运行此操作:
ray.init('tasks.ray-head:6379')
我希望这有助于处于相同情况下的其他人嗯,错误是运行
ray start
。是否已执行此操作?能否确认每个节点上是否正在运行光线进程?例如,如果您连接到一个节点并执行类似于ps aux | grep ray
的操作,您是否看到一组活动的光线进程?从每个节点(头节点和工作节点)可以直接连接到Redis吗?例如,在python中尝试导入redis;r=redis.StrictRedis(主机='ray-head',端口=6379);r、 设置('key'、'value'),这样行吗?一个可能出错的地方是Ray混淆了pod IP地址和物理机器IP地址。@samthegolden Yes,“Ray start”是通过“docker stack deploy”命令运行的。我已经用ps和日志输出编辑了原始帖子。@RobertNishihara感谢您的回答。我已将docker服务日志和ps aux的输出附加到容器中。我还在一个worker上尝试了您的redis测试:>>>导入redis>>>r=redis.StrictRedis(host='ray-head',port=6379)>>>r.set('key',value')True>>r.get('key')'value'看起来直接调用redis是可行的。两个问题(用于调试):1)是否ray.init(redis\u address='10.0.30.2:6379')
(在工作节点或头部节点上?2)ray.init(redis_address='localhost:6379')是否在头部节点上工作?看起来ray希望根据用于启动进程的命令使用地址10.0.30.2
(如ps
所示)。这可能是因为Ray在容器内或类似的情况下未使用正确的IP地址。错误为运行Ray start
。是否已执行此操作?能否确认Ray进程是否在每个节点上运行?例如,如果连接到一个节点并执行类似于ps aux | grep Ray
的操作,是否执行u看到一组活动的Ray进程吗?从每个节点(头节点和工作节点)可以直接连接到Redis吗?例如,在python中,尝试导入Redis;r=Redis.StricRedis(host='Ray-head',port=6379);r.set('key',value')),行得通吗?可能出错的一件事是Ray混淆了pod IP地址和物理机器IP地址。@samthegolden是的,“Ray start”是通过“docker stack deploy”命令运行的。我已经用ps和日志输出编辑了原始帖子。@RobertNishihara感谢您的回答。我已经附加了输出从docker服务日志和容器中的ps aux中获取t。我还对一名工作人员尝试了您的redis测试:>>>导入redis>>>r=redis.StricRedis(host='ray-head',port=6379)>>>r.set('key',value')True>>r.get('key')'value'看起来直接调用redis是可行的。两个问题(用于调试):1ray.init(redis_-address='10.0.30.2:6379')
是否工作(在工作节点或头部节点上)?2ray.init(redis_-address='localhost:6379')
是否工作在头部节点上?似乎Ray希望根据用于启动进程的命令使用地址10.0.30.2
(如ps
所示)。这可能是因为Ray在容器内或类似的情况下没有使用正确的IP地址。