Apache kafka Kafka broker在索引恢复过程中花费了很长时间,最终关闭

Apache kafka Kafka broker在索引恢复过程中花费了很长时间,最终关闭,apache-kafka,kubernetes-helm,azure-aks,confluent-platform,Apache Kafka,Kubernetes Helm,Azure Aks,Confluent Platform,我有一个3-broker,在Azure K8S上没有Kafka的副本设置,使用(使用) 在某个时刻(不幸的是,我没有日志),一个卡夫卡代理崩溃了,当它重新启动时,它进行了一个无休止的、痛苦的重新启动循环。它似乎试图恢复某些损坏的日志条目,花费了很长时间,然后挂断了SIGTERM。更糟糕的是,我不能再完整地消费/制作受影响的主题。下面附加的日志,以及显示Kafka缓慢浏览日志文件、填充磁盘缓存的监控屏幕截图 现在,我将log.retention.bytes设置为180GiB,但我希望保持这种方式,

我有一个3-broker,在Azure K8S上没有Kafka的副本设置,使用(使用)

在某个时刻(不幸的是,我没有日志),一个卡夫卡代理崩溃了,当它重新启动时,它进行了一个无休止的、痛苦的重新启动循环。它似乎试图恢复某些损坏的日志条目,花费了很长时间,然后挂断了
SIGTERM
。更糟糕的是,我不能再完整地消费/制作受影响的主题。下面附加的日志,以及显示Kafka缓慢浏览日志文件、填充磁盘缓存的监控屏幕截图

现在,我将
log.retention.bytes
设置为180GiB,但我希望保持这种方式,而不让卡夫卡陷入这个无休止的循环。我怀疑这可能是旧版本的问题,于是在《卡夫卡圣歌》(和)中搜索相关关键字,但什么也没找到

所以我不能依靠更新的版本来解决这个问题,我也不想依靠较小的保留大小,因为这样可能会弹出大量损坏的日志

因此,我的问题是——有没有办法做到以下任何一项/所有一项:

  • 阻止SIGTERM的发生,从而让卡夫卡完全康复
  • 允许在未受影响的分区上恢复消费/生产(30个分区中似乎只有4个分区的条目已损坏)
  • 否则阻止这种疯狂的发生
(如果没有,我将求助于:(a)升级Kafka;(b)将
log.retention.bytes
缩减一个数量级;(c)打开副本,希望这会有所帮助;(d)改进日志记录,首先找出导致崩溃的原因。)


日志 已完成加载日志,但清理+刷新中断的日志:

[2019-10-10 00:05:36,562 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,598 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 00:05:37,802 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 00:42:27,037] INFO Logs loading complete in 2210438 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,052] INFO Starting log cleanup with a period of 300000 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,054] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,057] INFO Starting the log cleaner (kafka.log.LogCleaner)
[2019-10-10 00:42:27,738] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 00:42:27,763] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)  
[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,504 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,549 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 01:55:27,123 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 02:17:01,249] INFO [ProducerStateManager partition=my-topic-12] Loading producer state from snapshot file '/opt/kafka/data-0/logs/my-topic-12/00000000000000004443.snapshot' (kafka.log.ProducerStateManager)
[2019-10-10 02:17:07,090] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 02:17:07,093] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Closing BaseMetricsReporter (io.confluent.support.metrics.BaseMetricsReporter)
[2019-10-10 02:17:07,093] INFO Waiting for metrics thread to exit (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Shutting down KafkaServer (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,097] INFO [KafkaServer id=2] shutting down (kafka.server.KafkaServer)
[2019-10-10 02:17:07,105] ERROR [KafkaServer id=2] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)
[2019-10-10 02:17:07,110] ERROR Caught exception when trying to shut down KafkaServer. Exiting forcefully. (io.confluent.support.metrics.SupportedServerStartable)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)
加载中断的日志:

[2019-10-10 00:05:36,562 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,564 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 00:05:36,598 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 00:05:37,802 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 00:42:27,037] INFO Logs loading complete in 2210438 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,052] INFO Starting log cleanup with a period of 300000 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,054] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2019-10-10 00:42:27,057] INFO Starting the log cleaner (kafka.log.LogCleaner)
[2019-10-10 00:42:27,738] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 00:42:27,763] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)  
[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Fetch: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,502 INFO [ThrottledChannelReaper-Produce: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,504 INFO [ThrottledChannelReaper-Request: Starting (kafka.server.ClientQuotaManager$ThrottledChannelReaper)
[2019-10-10 01:55:25,549 INFO Loading logs. (kafka.log.LogManager)
[2019-10-10 01:55:27,123 WARN [Log partition=my-topic-3, dir=/opt/kafka/data-0/logs] Found a corrupted index file corresponding to log file /opt/kafka/data-0/logs/my-topic-3/00000000000000031038.log due to Corrupt time index found, time index file (/opt/kafka/data-0/logs/my-topic-3/00000000000000031038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1570449760949}, recovering segment and rebuilding index files... (kafka.log.Log)
...
[2019-10-10 02:17:01,249] INFO [ProducerStateManager partition=my-topic-12] Loading producer state from snapshot file '/opt/kafka/data-0/logs/my-topic-12/00000000000000004443.snapshot' (kafka.log.ProducerStateManager)
[2019-10-10 02:17:07,090] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler)
[2019-10-10 02:17:07,093] INFO Shutting down SupportedServerStartable (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Closing BaseMetricsReporter (io.confluent.support.metrics.BaseMetricsReporter)
[2019-10-10 02:17:07,093] INFO Waiting for metrics thread to exit (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,093] INFO Shutting down KafkaServer (io.confluent.support.metrics.SupportedServerStartable)
[2019-10-10 02:17:07,097] INFO [KafkaServer id=2] shutting down (kafka.server.KafkaServer)
[2019-10-10 02:17:07,105] ERROR [KafkaServer id=2] Fatal error during KafkaServer shutdown. (kafka.server.KafkaServer)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)
[2019-10-10 02:17:07,110] ERROR Caught exception when trying to shut down KafkaServer. Exiting forcefully. (io.confluent.support.metrics.SupportedServerStartable)
java.lang.IllegalStateException: Kafka server is still starting up, cannot shut down!
    at kafka.server.KafkaServer.shutdown(KafkaServer.scala:560)
    at io.confluent.support.metrics.SupportedServerStartable.shutdown(SupportedServerStartable.java:147)
    at io.confluent.support.metrics.SupportedKafka$1.run(SupportedKafka.java:62)

监测

我在寻找类似问题的解决方案时发现了您的问题。
我想知道你是否解决了这个问题??
在此期间,谁在给SIGTERM打电话?可能是Kubernetes或其他编排器,您可以修改就绪探测以允许在其终止容器之前进行更多尝试。

还要确保您的xmx配置小于pod/容器分配的资源。否则Kubernetes会杀死这个pod(如果这里是Kubernetes)

我也有同样的问题,我通过增加kafka配置(server.properties文件)中的两个值来解决:

zookeeper.connection.timeout.ms
zookeeper.session.timeout.ms


我把它们的最大值都提高到18000。这两个值相同似乎没有用(至少根据数据)。但无论如何,它为我解决了这个问题。

自从我发布这篇文章以来,已经过了一段时间。IIRC,问题似乎是动物园管理员的超时,结果导致所有东西都关闭了。增加此超时+缩短日志大小+升级到更新版本的kafka使此问题消失,但带来了其他问题(磁盘损坏?!!)。也有此问题…有关于此问题的更新信息吗?toabi在我的案例中,它是kafka 2.3 issues.apache.org/jira/browse/kafka-9265中的错误,在2.4中修复。在删除主题并重新启动代理时发生。我认为在Kafka启动时,可能会从Zookeeper调用SIGTERM。我不知道为什么。这可能与您面临的内存泄漏问题有关,也可能与此无关。