Apache kafka 卡夫卡流-无法重新平衡错误

Apache kafka 卡夫卡流-无法重新平衡错误,apache-kafka,apache-kafka-streams,Apache Kafka,Apache Kafka Streams,我有一个基本的Kafka Streams应用程序,它可以读取in_主题,执行滚动聚合,并执行连接以发布到out_主题。这已经运行了好几个星期了,但今天早上它崩溃了,不再启动。我认为这与代码无关。发生错误之前的日志为: 2019-01-21 17:46:32,803 localhost org.apache.kafka.clients.producer.KafkaProducer: [Producer clientId=rtt-healthscore-stream-7d679951-913b-49

我有一个基本的Kafka Streams应用程序,它可以读取
in_主题
,执行滚动聚合,并执行连接以发布到
out_主题
。这已经运行了好几个星期了,但今天早上它崩溃了,不再启动。我认为这与代码无关。发生错误之前的日志为:

2019-01-21 17:46:32,803 localhost org.apache.kafka.clients.producer.KafkaProducer: [Producer clientId=rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1-0_0-producer, transactionalId=rtt-healthscore-stream-0_0] Instantiated a transactional producer.
2019-01-21 17:46:32,803 localhost org.apache.kafka.clients.producer.KafkaProducer: [Producer clientId=rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1-0_0-producer, transactionalId=rtt-healthscore-stream-0_0] Overriding the default acks to all since idempotence is enabled.
2019-01-21 17:46:32,818 localhost org.apache.kafka.common.utils.AppInfoParser: Kafka version : 2.0.0
2019-01-21 17:46:32,818 localhost org.apache.kafka.common.utils.AppInfoParser: Kafka commitId : 3402a8361b734732
2019-01-21 17:46:32,832 localhost org.apache.kafka.clients.producer.internals.TransactionManager: [Producer clientId=rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1-0_0-producer, transactionalId=rtt-healthscore-stream-0_0] ProducerId set to -1 with epoch -1
2019-01-21 17:47:32,833 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Error caught during partition assignment, will abort the current process and re-throw at the end of rebalance: {}
org.apache.kafka.common.errors.TimeoutException: Timeout expired while initializing transactional state in 60000ms.
2019-01-21 17:47:32,843 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] partition assignment took 60062 ms.
    current active tasks: []
    current standby tasks: []
    previous active tasks: []

2019-01-21 17:47:32,845 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] State transition from PARTITIONS_ASSIGNED to PENDING_SHUTDOWN
2019-01-21 17:47:32,845 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Shutting down
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] State transition from PENDING_SHUTDOWN to DEAD
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.KafkaStreams: stream-client [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804] State transition from REBALANCING to ERROR
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.KafkaStreams: stream-client [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804] All stream threads have died. The instance will be in error state and should be closed.
2019-01-21 17:47:32,860 localhost org.apache.kafka.streams.processor.internals.StreamThread: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Shutdown complete
Exception in thread "rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: stream-thread [rtt-healthscore-stream-7d679951-913b-4976-a43e-0b437c22c804-StreamThread-1] Failed to rebalance.
    at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:870)
    at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:810)
    at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:767)
    at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:736)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired while initializing transactional state in 60000ms.

所有kafka设置/配置均未更改,并且所有代理都可用。我的卡夫卡版本是2.0。我能够从控制台使用者中读取主题中的
,因此在此应用程序之前的一切都很好。感谢所有的帮助

升级到Kafka 2.1后,我们的项目出现了相同的超时故障,我们还不知道原因


我们的临时解决办法是禁用一次
配置,它跳过初始化事务状态。

在升级到2.1之后,我们也遇到了这些错误(我想以前升级到早期版本时也是如此)

我们在kubernetes环境中运行,在该环境中,在滚动升级之后,代理可能会更改IP地址。从代理日志:

[2019-02-20 02:20:20,085] WARN [TransactionCoordinator id=1001] Connection 
to node 0 (khaki-joey-kafka-0.khaki-joey-kafka-headless.hyperspace-dev/10.233.124.181:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2019-02-20 02:20:57,205] WARN [TransactionCoordinator id=1001] Connection to node 1 (khaki-joey-kafka-1.khaki-joey-kafka-headless.hyperspace-dev/10.233.122.67:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
我可以看到,事务协调器仍在为2个代理使用过时的IP地址,这2个代理在升级后(升级后一天)重新启动

可能的选择:

  • 如上所述,为您的拖缆关闭一次。然后它就不使用事务了,而且似乎一切正常。如果您需要EOS或某些其他客户端代码需要事务,则这将不起作用
  • 重新启动所有报告警告的代理,以强制它们重新解析IP地址。它们需要以不改变IP地址的方式重新启动。在库伯内特斯通常不可能
提出的缺陷

更新2017-02-20今天发布的卡夫卡2.1.1(融合5.1.2)可能已经解决了这一问题。请参阅链接的问题。

升级后已解决
It's resolved after upgrade
https://kafka.apache.org/25/documentation/streams/developer-guide/write-streams.html

<dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-streams</artifactId>
        <version>2.5.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-clients</artifactId>
        <version>2.5.0</version>
    </dependency>
    <!-- Optionally include Kafka Streams DSL for Scala for Scala 2.12 -->
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-streams-scala_2.12</artifactId>
        <version>2.5.0</version>
    </dependency>
https://kafka.apache.org/25/documentation/streams/developer-guide/write-streams.html org.apache.kafka 卡夫卡河 2.5.0 org.apache.kafka 卡夫卡客户 2.5.0 org.apache.kafka 卡夫卡-溪流-scala_2.12 2.5.0
您可以检查代理日志中是否有任何错误或警告消息吗?:这些都是应用程序停止处理数据时的所有日志。我试着只更改坏掉的应用程序的app_id,一切正常。因此,这似乎是一个与app_id相关的访问问题。可能是因为它试图访问损坏的数据,并被暂停/不知道在其他地方查找该数据。我们有2个和4个代理的复制。因此,为了跟进您的情况,我尝试了完整的应用程序重置(全局/本地),但仍然存在相同的问题。巧合的是,其中一个代理节点在发生此错误的同一时间发生故障。根据我的经验(Kafka 2.3.x),此问题尚未得到解决(Confluent version 5.3.x)。