Apache storm Storm拓扑不会以并行提示1200开始

Apache storm Storm拓扑不会以并行提示1200开始,apache-storm,apache-storm-topology,Apache Storm,Apache Storm Topology,我有一个带有3个螺栓(a、B、C)的storm拓扑,其中中间螺栓的平均时间约为450ms,其他两个螺栓的平均时间不到1ms 我能够使用以下并行性提示值运行拓扑: Version Info: "org.apache.storm" % "storm-core" % "1.2.1" "org.apache.storm" % "storm-kafka-client" % "1.2.1" 但当我将并行性提示B增加到1200时,拓扑不会启动 在拓扑日志中,我看到多次加载executor:

我有一个带有3个螺栓(a、B、C)的storm拓扑,其中中间螺栓的平均时间约为450ms,其他两个螺栓的平均时间不到1ms

我能够使用以下并行性提示值运行拓扑:

Version Info: 
   "org.apache.storm" % "storm-core" % "1.2.1" 
   "org.apache.storm" % "storm-kafka-client" % "1.2.1" 
但当我将并行性提示B增加到1200时,拓扑不会启动

在拓扑日志中,我看到多次加载executor:B的日志,如下所示:

A: 4 
B: 700
C: 10
但在这段时间内,工作进程将重新启动。我在拓扑日志或风暴日志中没有看到任何错误。以下是工人重新启动时的风暴日志:

2018-05-18 18:56:37.462 o.a.s.d.executor main [INFO] Loading executor B:[111 111]
2018-05-18 18:56:37.463 o.a.s.d.executor main [INFO] Loaded executor tasks B:[111 111]
2018-05-18 18:56:37.465 o.a.s.d.executor main [INFO] Finished loading executor B:[111 111]
2018-05-18 18:56:37.528 o.a.s.d.executor main [INFO] Loading executor B:[355 355]
2018-05-18 18:56:37.529 o.a.s.d.executor main [INFO] Loaded executor tasks B:[355 355]
2018-05-18 18:56:37.530 o.a.s.d.executor main [INFO] Finished loading executor B:[355 355]
2018-05-18 18:56:37.666 o.a.s.d.executor main [INFO] Loading executor B:[993 993]
2018-05-18 18:56:37.667 o.a.s.d.executor main [INFO] Loaded executor tasks B:[993 993]
2018-05-18 18:56:37.669 o.a.s.d.executor main [INFO] Finished loading executor B:[993 993]
2018-05-18 18:56:37.713 o.a.s.d.executor main [INFO] Loading executor B:[765 765]
2018-05-18 18:56:37.714 o.a.s.d.executor main [INFO] Loaded executor tasks B:[765 765]
这种情况不断发生,拓扑永远不会重新启动,当bolt:B的并行性提示为700时,拓扑会完美启动,并没有其他更改

我在这里看到一个有趣的日志,但还不确定这意味着什么:

工作进程766258fe-a604-4385-8eeb-e85cad38b674已退出,代码为143

有什么建议吗

编辑:

配置:

2018-05-18 18:51:46.755 o.a.s.d.s.Container SLOT_6700 [INFO] Killing eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.204 o.a.s.d.s.BasicContainer Thread-7 [INFO] Worker Process 766258fe-a604-4385-8eeb-e85cad38b674 exited with code: 143
2018-05-18 18:51:47.766 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING msInState: 109081 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674 -> KILL msInState: 0 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.766 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.774 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700 all processes are dead...
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] Cleaning up eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids/27798
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/heartbeats
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/tmp
2018-05-18 18:51:47.781 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.Container SLOT_6700 [INFO] REMOVE worker-user 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers-users/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Removed Worker ID 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up BLOB references...
2018-05-18 18:51:47.784 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up basic files...
2018-05-18 18:51:47.785 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/supervisor/stormdist/myTopology-1-1526649581
2018-05-18 18:51:47.808 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE KILL msInState: 42 topo:myTopology-1-1526649581 worker:null -> EMPTY msInState: 0
编辑:

中的
strace-fp PID-e trace=读取、写入、网络、信号、ipc的日志

还不能完全理解,一些相关人员从中看到:

[pid 3362]打开(“/usr/lib/locale/UTF-8/LC_CTYPE”,O_RDONLY)=-1 enoint(没有这样的文件或目录)

[pid 3362]压井(1487,SIGTERM)=0

[pid 3362]关闭(1)


Quick google建议143是JVM收到SIGTERM时的退出代码(例如)。您可能正在耗尽内存,或者操作系统可能会因为其他原因终止进程。请记住,将并行性提示设置为1200意味着您将获得螺栓B的1200个任务(副本),而之前只有700个

我可以通过调整以下配置来运行它,似乎是由于
nimbus.task.launch.sec
而超时,该值设置为120,如果在120秒内未启动,它将重新启动工作程序

其中一些配置的更新值:

topology.worker.childopts: -Xms1g -Xmx16g
topology.worker.logwriter.childopts: -Xmx1024m
topology.worker.max.heap.size.mb: 3072.0
worker.childopts: -Xms1g -Xmx16g -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=1%ID% -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:+UseG1GC -XX:+AggressiveOpts -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/home/saurabh.mimani/apache-storm-1.2.1/logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Dorg.newsclub.net.unix.library.path=/usr/share/specter/uds-lib/
worker.gc.childopts:
worker.heap.memory.mb: 8192
supervisor.childopts: -Xms1g -Xmx16g
关于:

任务初始启动时使用的特殊超时。在启动期间,这是在第一次心跳之前使用的超时,覆盖nimbus.task.timeout.secs。 启动时存在一个单独的超时,因为启动新的JVM并配置它们可能会有相当大的开销


我在那台机器上有很多内存,我看到显示可用,我没有得到任何内存不足的错误。我已经为worker提供了最大16gb内存,所以它应该足以产生1200个线程。我在问题中添加了一些相关的配置。在这种情况下,我可能会尝试使用strace来找出SIGTERM可能来自何处。我还使用strace日志更新了问题,仍在尝试找出它们。
drpc.request.timeout.secs: 1600
supervisor.worker.start.timeout.secs: 1200
nimbus.supervisor.timeout.secs: 1200
nimbus.task.launch.secs: 1200