Haskell futex()调用时应用程序冻结

Haskell futex()调用时应用程序冻结,haskell,freeze,Haskell,Freeze,我用Haskell编写了微服务。它使用Scotty。它是13.20。操作系统:Linux 3.10.0-957.el7.x86_64,它在Kubernetes下工作。这项服务运行了大约0.5年,没有任何问题,但现在我遇到了几次神秘的冰冻。我认为这不是回归的结果,因为代码没有修改,但服务的负载增加了 病征包括: CPU消耗-正常 内存消耗-正常 strace报告futex(…)呼叫冻结:futex(0x349c9c4,futex_WAIT_PRIVATE,83,NULL 许多线程看起来像: 使

我用Haskell编写了微服务。它使用Scotty。它是13.20。操作系统:Linux 3.10.0-957.el7.x86_64,它在Kubernetes下工作。这项服务运行了大约0.5年,没有任何问题,但现在我遇到了几次神秘的冰冻。我认为这不是回归的结果,因为代码没有修改,但服务的负载增加了

病征包括:

  • CPU消耗-正常
  • 内存消耗-正常
  • strace
    报告futex(…)呼叫冻结:
    futex(0x349c9c4,futex_WAIT_PRIVATE,83,NULL
  • 许多线程看起来像:
使用
gdb
连接到PID后:

[New LWP 4487]
[New LWP 4486]
[New LWP 4485]
[New LWP 4484]
[New LWP 4483]
[New LWP 4482]
[New LWP 4481]
[New LWP 4480]
[New LWP 4479]
[New LWP 4478]
[New LWP 4477]
[New LWP 4476]
[New LWP 4475]
[New LWP 4474]
[New LWP 4473]
[New LWP 4472]
[New LWP 4471]
[New LWP 4470]
[New LWP 4469]
[New LWP 4468]
[New LWP 4467]
[New LWP 4466]
[New LWP 4465]
....
(gdb) bt full
#0  0x00007fc03ec23965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00000000015185e9 in waitCondition (pCond=pCond@entry=0x2f029c0, pMut=pMut@entry=0x2f029f0) at rts/posix/OSThreads.c:117
No locals.
#2  0x000000000150713b in waitForWorkerCapability (task=<optimized out>) at rts/Capability.c:651
        cap = <optimized out>
#3  yieldCapability (pCap=pCap@entry=0x7fffc6ae0a78, task=task@entry=0x2f029b0, gcAllowed=gcAllowed@entry=true) at rts/Capability.c:888
        cap = <optimized out>
#4  0x0000000001504d85 in scheduleYield (task=0x2f029b0, pcap=0x7fffc6ae0a70) at rts/Schedule.c:672
        cap = 0x2e7cff0
        didGcLast = <optimized out>
#5  schedule (initialCapability=initialCapability@entry=0x2edf1b0, task=task@entry=0x2f029b0) at rts/Schedule.c:292
        t = <optimized out>
        cap = 0x2e7cff0
        ret = <optimized out>
        prev_what_next = <optimized out>
        ready_to_gc = <optimized out>
#6  0x0000000001505bee in scheduleWaitThread (tso=0x4200823388, ret=ret@entry=0x0, pcap=pcap@entry=0x7fffc6ae0b08) at rts/Schedule.c:2533
        task = 0x2f029b0
        cap = 0x2edf1b0
#7  0x0000000001500584 in rts_evalLazyIO (cap=cap@entry=0x7fffc6ae0b08, p=p@entry=0x15a00d0, ret=ret@entry=0x0) at rts/RtsAPI.c:530
        tso = <optimized out>
#8  0x00000000015102be in hs_main (argc=2, argv=0x7fffc6ae0cf8, main_closure=0x15a00d0, rts_config=...) at rts/RtsMain.c:72
        cap = 0x2edf1b0
        exit_status = <optimized out>
        status = <optimized out>
#9  0x00000000004311b0 in main ()
No symbol table info available.
[新LWP 4487]
[新LWP 4486]
[新LWP 4485]
[新LWP 4484]
[新LWP 4483]
[新LWP 4482]
[新LWP 4481]
[新LWP 4480]
[新LWP 4479]
[新LWP 4478]
[新LWP 4477]
[新LWP 4476]
[新LWP 4475]
[新LWP 4474]
[新LWP 4473]
[新LWP 4472]
[新LWP 4471]
[新LWP 4470]
[新LWP 4469]
[新LWP 4468]
[新LWP 4467]
[新LWP 4466]
[新LWP 4465]
....
(gdb)英国电信全部
#0 0x00007fc03ec23965在/lib64/libpthread.so.0的pthread_cond_wait@@GLIBC_2.3.2()中
没有可用的符号表信息。
#1 0x00000000015185e9处于等待状态(pCond=pCond@entry=0x2f029c0,pMut=pMut@entry=0x2f029f0)在rts/posix/OSThreads.c:117
没有本地人。
#2 0x000000000150713b处于rts/Capability的waitForWorkerCapability(任务=)中。c:651
上限=
#3产能(pCap)=pCap@entry=0x7fffc6ae0a78,任务=task@entry=0x2f029b0,不允许=gcAllowed@entry=真)在rts/能力下。c:888
上限=
#在rts/Schedule时,scheduleYield中的4 0x0000000001504d85(任务=0x2f029b0,pcap=0x7fffc6ae0a70)。c:672
cap=0x2e7cff0
DIDGLAST=
#5附表(初始能力)=initialCapability@entry=0x2edf1b0,任务=task@entry=0x2f029b0)在rts/计划中。c:292
t=
cap=0x2e7cff0
ret=
上一步下一步是什么=
准备就绪\u到\u gc=
#scheduleWaitThread中的6 0x0000000001505bee(tso=0x4200823388,ret=ret@entry=0x0,pcap=pcap@entry=0x7fffc6ae0b08)在rts/计划中。c:2533
任务=0x2f029b0
cap=0x2edf1b0
#7 0x0000000001500584英寸rts_evalLazyIO(第=cap@entry=0x7fffc6ae0b08,p=p@entry=0x15a00d0,ret=ret@entry=0x0)在rts/RtsAPI.c:530
tso=
#8 0x00000000015102be位于rts/RtsMain的hs_main(argc=2,argv=0x7fffc6ae0cf8,main_closure=0x15a00d0,rts_config=…)中。c:72
cap=0x2edf1b0
退出状态=
状态=
#主管道中的9 0x00000000004311b0()
没有可用的符号表信息。

所以,我的问题是:如何修复它,调查,什么是好的尝试,检查?我有想法切换到新的LTS,但我不确定这是问题的原因(我在网络论坛中发现了旧LTS/GHC版本的类似问题)…我觉得这看起来像是RTS中的一个bug。

通常,当没有工作要做时,工作人员会在
waitForWorkerCapability
中等待条件。例如,所有haskell线程都在IO上被阻止,因此我们没有任何东西要运行。该条件会发出信号,在同一文件中的几个位置调用

如果你确定应该有工作要做,那么你可能在RTS中发现了一个bug。试着用一个最小的例子来重现这个问题。(我知道,这通常根本不可能。)

但您的代码或依赖项中可能存在错误。您可以尝试在gdb中进行检查。(它们上有一个错误)。您可能对
运行任务
运行队列(u hd
挂起的(u ccall)
备用工作人员
返回任务(u hd)字段感兴趣。我认为您应该在所有功能上都没有运行任务和空运行队列,并且所有工作人员都应该在某些功能的
备用工作人员
列表中

(这只是我对这个问题的理解。我不是GHC RTS方面的专家,可能在胡说八道。)

[New LWP 4487]
[New LWP 4486]
[New LWP 4485]
[New LWP 4484]
[New LWP 4483]
[New LWP 4482]
[New LWP 4481]
[New LWP 4480]
[New LWP 4479]
[New LWP 4478]
[New LWP 4477]
[New LWP 4476]
[New LWP 4475]
[New LWP 4474]
[New LWP 4473]
[New LWP 4472]
[New LWP 4471]
[New LWP 4470]
[New LWP 4469]
[New LWP 4468]
[New LWP 4467]
[New LWP 4466]
[New LWP 4465]
....
(gdb) bt full
#0  0x00007fc03ec23965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00000000015185e9 in waitCondition (pCond=pCond@entry=0x2f029c0, pMut=pMut@entry=0x2f029f0) at rts/posix/OSThreads.c:117
No locals.
#2  0x000000000150713b in waitForWorkerCapability (task=<optimized out>) at rts/Capability.c:651
        cap = <optimized out>
#3  yieldCapability (pCap=pCap@entry=0x7fffc6ae0a78, task=task@entry=0x2f029b0, gcAllowed=gcAllowed@entry=true) at rts/Capability.c:888
        cap = <optimized out>
#4  0x0000000001504d85 in scheduleYield (task=0x2f029b0, pcap=0x7fffc6ae0a70) at rts/Schedule.c:672
        cap = 0x2e7cff0
        didGcLast = <optimized out>
#5  schedule (initialCapability=initialCapability@entry=0x2edf1b0, task=task@entry=0x2f029b0) at rts/Schedule.c:292
        t = <optimized out>
        cap = 0x2e7cff0
        ret = <optimized out>
        prev_what_next = <optimized out>
        ready_to_gc = <optimized out>
#6  0x0000000001505bee in scheduleWaitThread (tso=0x4200823388, ret=ret@entry=0x0, pcap=pcap@entry=0x7fffc6ae0b08) at rts/Schedule.c:2533
        task = 0x2f029b0
        cap = 0x2edf1b0
#7  0x0000000001500584 in rts_evalLazyIO (cap=cap@entry=0x7fffc6ae0b08, p=p@entry=0x15a00d0, ret=ret@entry=0x0) at rts/RtsAPI.c:530
        tso = <optimized out>
#8  0x00000000015102be in hs_main (argc=2, argv=0x7fffc6ae0cf8, main_closure=0x15a00d0, rts_config=...) at rts/RtsMain.c:72
        cap = 0x2edf1b0
        exit_status = <optimized out>
        status = <optimized out>
#9  0x00000000004311b0 in main ()
No symbol table info available.