Concurrency 在其他节点上终止错误的进程?

Concurrency 在其他节点上终止错误的进程?,concurrency,erlang,Concurrency,Erlang,我编写了一个简单的程序(“controller”),在一个单独的节点(“worker”)上运行一些计算。原因是,如果工作节点内存不足,控制器仍能工作: -module(controller). -compile(export_all). p(Msg,Args) -> io:format("~p " ++ Msg, [time() | Args]). progress_monitor(P,N) -> timer:sleep(5*60*1000), p("killin

我编写了一个简单的程序(“controller”),在一个单独的节点(“worker”)上运行一些计算。原因是,如果工作节点内存不足,控制器仍能工作:

-module(controller).
-compile(export_all).

p(Msg,Args) -> io:format("~p " ++ Msg, [time() | Args]).

progress_monitor(P,N) ->
    timer:sleep(5*60*1000),
    p("killing the worker which was using strategy #~p~n", [N]),
    exit(P, took_to_long).

start() ->
    start(1).
start(Strat) ->
    P = spawn('worker@localhost', worker, start, [Strat,self(),60000000000]),
    p("starting worker using strategy #~p~n", [Strat]),
    spawn(controller,progress_monitor,[P,Strat]),
    monitor(process, P),
    receive
        {'DOWN', _, _, P, Info} ->
            p("worker using strategy #~p died. reason: ~p~n", [Strat, Info]);
        X ->
            p("got result: ~p~n", [X])
    end,
    case Strat of
        4 -> p("out of strategies. giving up~n", []);
        _ -> timer:sleep(5000), % wait for node to come back
             start(Strat + 1)
    end.
为了测试它,我特意编写了3个阶乘实现,它们将占用大量内存并崩溃,第四个实现使用尾部递归以避免占用太多空间:

-module(worker).
-compile(export_all).

start(1,P,N) -> P ! factorial1(N);
start(2,P,N) -> P ! factorial2(N);
start(3,P,N) -> P ! factorial3(N);
start(4,P,N) -> P ! factorial4(N,1).

factorial1(0) -> 1;
factorial1(N) -> N*factorial1(N-1).

factorial2(N) ->
    case N of
        0 -> 1;
        _ -> N*factorial2(N-1)
    end.

factorial3(N) -> lists:foldl(fun(X,Y) -> X*Y end, 1, lists:seq(1,N)).

factorial4(0, A) -> A;
factorial4(N, A) -> factorial4(N-1, A*N).
注意,即使使用尾部递归版本,我也使用6000000000调用它,即使使用
factorial4
,在我的机器上也可能需要几天的时间。以下是运行控制器的输出:

$ erl -sname 'controller@localhost'
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(controller@localhost)1> c(worker).
{ok,worker}
(controller@localhost)2> c(controller).
{ok,controller}
(controller@localhost)3> controller:start().
{23,24,28} starting worker using strategy #1
{23,25,13} worker using strategy #1 died. reason: noconnection
{23,25,18} starting worker using strategy #2
{23,26,2} worker using strategy #2 died. reason: noconnection
{23,26,7} starting worker using strategy #3
{23,26,40} worker using strategy #3 died. reason: noconnection
{23,26,45} starting worker using strategy #4
{23,29,28} killing the worker which was using strategy #1
{23,29,29} worker using strategy #4 died. reason: took_to_long
{23,29,29} out of strategies. giving up
ok
$ while true; do erl -sname 'worker@localhost'; done
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "old_heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
这几乎起作用了,但4号工人被杀得太早了(应该接近23:31:45,而不是23:29:29)。深入观察,只有一名工人试图被杀害,没有其他人。所以4号工人不应该死,但它确实死了。为什么?我们甚至可以看到原因是
花了很长时间
,而
进度监视器
1是在23:24:28开始的,比23:29:29早了五分钟。所以它看起来像是
progress\u monitor
#1杀死了工人#4而不是工人#1。为什么它杀死了错误的进程

以下是我运行控制器时工作进程的输出:

$ erl -sname 'controller@localhost'
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(controller@localhost)1> c(worker).
{ok,worker}
(controller@localhost)2> c(controller).
{ok,controller}
(controller@localhost)3> controller:start().
{23,24,28} starting worker using strategy #1
{23,25,13} worker using strategy #1 died. reason: noconnection
{23,25,18} starting worker using strategy #2
{23,26,2} worker using strategy #2 died. reason: noconnection
{23,26,7} starting worker using strategy #3
{23,26,40} worker using strategy #3 died. reason: noconnection
{23,26,45} starting worker using strategy #4
{23,29,28} killing the worker which was using strategy #1
{23,29,29} worker using strategy #4 died. reason: took_to_long
{23,29,29} out of strategies. giving up
ok
$ while true; do erl -sname 'worker@localhost'; done
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "old_heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 

有几个问题,最终您会遇到创建编号环绕

由于您没有取消
进度\u监视器
过程,它将在5分钟后始终发送退出信号

计算时间长和/或VM速度慢,因此进程4在进程1的进度监视器启动5分钟后仍在运行

4个工作节点以相同的名称顺序启动
workers@localhost
,并且第一个和第四个节点的创建编号相同

创建编号(引用和PID中的创建字段)是一种机制,用于防止崩溃节点创建的PID和引用被具有相同名称的新节点解释。正如您在节点长时间不在后试图杀死worker 1时所期望的那样,您不打算杀死重新启动的节点中的进程

当节点发送pid或引用时。当它从另一个节点接收到pid或引用时,它会检查pid中的创建编号是否与其自己的创建编号匹配。创建编号由
epmd
归属

不幸的是,在这里,当第4个节点获得退出消息时,创建编号匹配,因为该序列已包装。由于节点生成了进程,并且之前做了完全相同的事情(初始化了erlang),因此节点4的辅助进程的pid与节点1的辅助进程的pid匹配

结果,控制器最终杀死了工作人员4,认为这是工作人员1


为了避免这种情况,如果在pid或控制器中的一个引用的生命周期内可以有4个工作线程,则需要比创建编号更健壮的工作线程。

1。谷歌搜索没有提到“创建编号”,它是什么?2.每次发送退出信号都可以,因为发送到死机pid不会造成任何伤害。3.我故意使计算变慢以测试超时。4.你说工人的名字相同是什么意思?在我的程序中,他们唯一的“名字”是策略编号,即1、2、3和4。我完全不理解你关于“创建编号”和“名称”的段落。我更新了答案以澄清。名称是节点的名称。我添加了一个指向文档的链接,其中提到了创建。但是,我找不到关于创建编号的1,2,3序列的文档,所以我链接到了源代码。@参见中的第4.1节(“Pid重用”)