Memory management ElasticSearch可以'；t堆利用率高后自动恢复_Memory Management_<img Src="//i.stack.imgur.com/RUiNP.png" Height="16" Width="18" Alt="" Class="sponsor Tag Img">elasticsearch

Memory management ElasticSearch可以'；t堆利用率高后自动恢复

memory-management

Memory management ElasticSearch可以'；t堆利用率高后自动恢复,memory-management,elasticsearch,Memory Management,elasticsearch,我想谈谈ES集群面临的一个问题今天，我们的集群中有四台机器，每台机器有16GB的RAM（8GB堆和8GB操作系统）。我们总共有73975578个文档、998个碎片和127个索引。为了索引我们的文档，我们使用批量API。今天，我们的批量请求是以总计最多300个项目。我们将文档放入队列中，以便在后台发出请求。下面的日志显示了一段关于被删除文档数量的信息已为ES发送到索引： [2014-12-03 11:19:32 -0200] execute Event Create with 77 i

我想谈谈ES集群面临的一个问题

今天，我们的集群中有四台机器，每台机器有16GB的RAM（8GB堆和8GB操作系统）。我们总共有73975578个文档、998个碎片和127个索引。为了索引我们的文档，我们使用批量API。今天，我们的批量请求是以总计最多300个项目。我们将文档放入队列中，以便在后台发出请求。下面的日志显示了一段关于被删除文档数量的信息已为ES发送到索引：

[2014-12-03 11:19:32 -0200] execute Event Create with 77 items in app 20
[2014-12-03 11:19:32 -0200] execute User Create with 1 items in app 67
[2014-12-03 11:19:40 -0200] execute User Create with 1 items in app 61
[2014-12-03 11:19:49 -0200] execute User Create with 1 items in app 62
[2014-12-03 11:19:50 -0200] execute User Create with 1 items in app 27
[2014-12-03 11:19:50 -0200] execute User Create with 2 items in app 20
[2014-12-03 11:19:54 -0200] execute User Create with 5 items in app 61
[2014-12-03 11:19:58 -0200] execute User Update with 61 items in app 20
[2014-12-03 11:20:02 -0200] execute User Create with 2 items in app 61
[2014-12-03 11:20:02 -0200] execute User Create with 1 items in app 27
[2014-12-03 11:20:10 -0200] execute User Create with 2 items in app 20
[2014-12-03 11:20:19 -0200] execute User Create with 5 items in app 61
[2014-12-03 11:20:20 -0200] execute User Create with 3 items in app 20
[2014-12-03 11:20:20 -0200] execute User Create with 1 items in app 24
[2014-12-03 11:20:25 -0200] execute User Create with 1 items in app 61
[2014-12-03 11:20:28 -0200] execute User Create with 1 items in app 20
[2014-12-03 11:20:37 -0200] execute Event Create with 91 items in app 20
[2014-12-03 11:20:42 -0200] execute User Create with 1 items in app 76
[2014-12-03 11:20:42 -0200] execute Event Create with 300 items in app 61
[2014-12-03 11:20:50 -0200] execute User Create with 4 items in app 61
[2014-12-03 11:20:51 -0200] execute User Create with 1 items in app 62
[2014-12-03 11:20:51 -0200] execute User Create with 2 items in app 20
[2014-12-03 11:20:55 -0200] execute User Create with 3 items in app 61

有时，请求只发生在批量中的一个项目上。另一个有趣的问题是：我们经常发送数据，换句话说，就是我们施加的压力这是相当高的

最大的问题是当ES堆开始接近75%的利用率时，GC没有达到其正常值

此日志入口显示一些GC工作：

[2014-12-02 21:28:04,766][WARN ][monitor.jvm              ] [es-node-2] [gc][old][43249][56] duration [48s], collections [2]/[48.2s], total [48s]/[17.9m], memory [8.2gb]->[8.3gb]/[8.3gb], all_pools {[young] [199.6mb]->[199.6mb]/[199.6mb]}{[survivor] [14.1mb]->[18.9mb]/[24.9mb]}{[old] [8gb]->[8gb]/[8gb]}
[2014-12-02 21:28:33,120][WARN ][monitor.jvm              ] [es-node-2] [gc][old][43250][57] duration [28.3s], collections [1]/[28.3s], total [28.3s]/[18.4m], memory [8.3gb]->[8.3gb]/[8.3gb], all_pools {[young] [199.6mb]->[199.6mb]/[199.6mb]}{[survivor] [18.9mb]->[17.5mb]/[24.9mb]}{[old] [8gb]->[8gb]/[8gb]}
[2014-12-02 21:29:21,222][WARN ][monitor.jvm              ] [es-node-2] [gc][old][43251][59] duration [47.9s], collections [2]/[48.1s], total [47.9s]/[19.2m], memory [8.3gb]->[8.3gb]/[8.3gb], all_pools {[young] [199.6mb]->[199.6mb]/[199.6mb]}{[survivor] [17.5mb]->[21.2mb]/[24.9mb]}{[old] [8gb]->[8gb]/[8gb]}
[2014-12-02 21:30:08,916][WARN ][monitor.jvm              ] [es-node-2] [gc][old][43252][61] duration [47.5s], collections [2]/[47.6s], total [47.5s]/[20m], memory [8.3gb]->[8.3gb]/[8.3gb], all_pools {[young] [199.6mb]->[199.6mb]/[199.6mb]}{[survivor] [21.2mb]->[20.8mb]/[24.9mb]}{[old] [8gb]->[8gb]/[8gb]}
[2014-12-02 21:30:56,208][WARN ][monitor.jvm              ] [es-node-2] [gc][old][43253][63] duration [47.1s], collections [2]/[47.2s], total [47.1s]/[20.7m], memory [8.3gb]->[8.3gb]/[8.3gb], all_pools {[young] [199.6mb]->[199.6mb]/[199.6mb]}{[survivor] [20.8mb]->[24.8mb]/[24.9mb]}{[old] [8gb]->[8gb]/[8gb]}
[2014-12-02 21:32:07,013][WARN ][transport                ] [es-node-2] Received response for a request that has timed out, sent [165744ms] ago, timed out [8ms] ago, action [discovery/zen/fd/ping], node [[es-node-1][sXwCdIhSRZKq7xZ6TAQiBg][localhost][inet[xxx.xxx.xxx.xxx/xxx.xxx.xxx.xxx:9300]]], id [3002106]
[2014-12-02 21:36:41,880][WARN ][monitor.jvm              ] [es-node-2] [gc][old][43254][78] duration [5.7m], collections [15]/[5.7m], total [5.7m]/[26.5m], memory [8.3gb]->[8.3gb]/[8.3gb], all_pools {[young] [199.6mb]->[199.6mb]/[199.6mb]}{[survivor] [24.8mb]->[24.4mb]/[24.9mb]}{[old] [8gb]->[8gb]/[8gb]}

我们经常使用的另一个部分是ES搜索，这些线显示了一些日志入口在搜索完成时生成的

[2014-12-03 11:43:22 -0200] buscou pagina 1 de 111235 (10 por pagina) do app 61
[2014-12-03 11:44:12 -0200] buscou pagina 1 de 30628 (10 por pagina) do app 5
[2014-12-03 11:44:13 -0200] buscou pagina 1 de 30628 (10 por pagina) do app 5
[2014-12-03 11:44:24 -0200] buscou pagina 1 de 63013 (10 por pagina) do app 20
[2014-12-03 11:44:24 -0200] buscou pagina 1 de 63013 (10 por pagina) do app 20
[2014-12-03 11:44:24 -0200] buscou pagina 1 de 63013 (10 por pagina) do app 20

这些链接显示了一些屏幕截图，这些截图显示了一些群集信息：

我们已经优化了一些部件，我们机器的配置如下：

threadpool.index.type: fixed
threadpool.index.size: 30
threadpool.index.queue_size: 1000
threadpool.bulk.type: fixed
threadpool.bulk.size: 30
threadpool.bulk.queue_size: 1000
threadpool.search.type: fixed
threadpool.search.size: 100
threadpool.search.queue_size: 200
threadpool.get.type: fixed
threadpool.get.size: 100
threadpool.get.queue_size: 200
index.merge.policy.max_merged_segment: 2g
index.merge.policy.segments_per_tier: 5
index.merge.policy.max_merge_at_once: 5
index.cache.field.type: soft
index.cache.field.expire: 1m
index.refresh_interval: 60s
bootstrap.mlockall: true
indices.memory.index_buffer_size: '15%'
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ['xxx.xxx.xxx.xxx', 'xxx.xxx.xxx.xxx', 'xxx.xxx.xxx.xxx']

我们最初的索引流程没有任何问题，几天后问题开始出现用法。有时集群在4或5天内保持正常，然后显示出一些问题

堆利用率。我们缺少任何配置或优化吗？

我不明白你文章的标题。这似乎与你的描述不符。有什么问题？另外，请记住，批量请求确实需要内存，并且批量大小需要根据您的系统限制进行设置。不是关于你批量发送的文档数量，而是它们的实际大小。谢谢你的回答Andrei！实际上，标题表明，当集群中的一台机器无法从大堆利用率中恢复（GC无法及时运行）时，所有集群都会受到影响。是的，我知道物理尺寸很重要。我减少了每台机器中的线程数量，现在集群似乎更稳定了。我将继续监控集群，并告知您是否发生了问题。以下是ES团队的建议，供您参考：