Apache kafka 使用NFS删除pod后,卡夫卡pod无法出现

Apache kafka 使用NFS删除pod后,卡夫卡pod无法出现,apache-kafka,kubernetes,nfs,Apache Kafka,Kubernetes,Nfs,我们试图使用NFS provisioner在Kubernetes上运行Kafka集群。集群运行良好。然而,当我们杀死一个卡夫卡吊舱时,替换吊舱却没有出现 pod删除前的持久卷: # mount 10.102.32.184:/export/pvc-ce1461b3-1b38-11e8-a88e-005056073f99 on /opt/kafka/data type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,

我们试图使用NFS provisioner在Kubernetes上运行Kafka集群。集群运行良好。然而,当我们杀死一个卡夫卡吊舱时,替换吊舱却没有出现

pod删除前的持久卷:

# mount
10.102.32.184:/export/pvc-ce1461b3-1b38-11e8-a88e-005056073f99 on /opt/kafka/data type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.133.40.245,local_lock=none,addr=10.102.32.184)

# ls -al /opt/kafka/data/logs
total 4
drwxr-sr-x 2 99 99 152 Feb 26 21:07 .
drwxrwsrwx 3 99 99  18 Feb 26 21:07 ..
-rw-r--r-- 1 99 99   0 Feb 26 21:07 .lock
-rw-r--r-- 1 99 99   0 Feb 26 21:07 cleaner-offset-checkpoint
-rw-r--r-- 1 99 99  57 Feb 26 21:07 meta.properties
-rw-r--r-- 1 99 99   0 Feb 26 21:07 recovery-point-offset-checkpoint
-rw-r--r-- 1 99 99   0 Feb 26 21:07 replication-offset-checkpoint

# cat /opt/kafka/data/logs   /meta.properties
#
#Mon Feb 26 21:07:08 UTC 2018
version=0
broker.id=1003
删除pod:

kubectl delete pod kafka-iced-unicorn-1
新创建的pod中重新连接的永久卷:

# ls -al /opt/kafka/data/logs
total 4
drwxr-sr-x 2 99 99 180 Feb 26 21:10 .
drwxrwsrwx 3 99 99  18 Feb 26 21:07 ..
-rw-r--r-- 1 99 99   0 Feb 26 21:10 .kafka_cleanshutdown
-rw-r--r-- 1 99 99   0 Feb 26 21:07 .lock
-rw-r--r-- 1 99 99   0 Feb 26 21:07 cleaner-offset-checkpoint
-rw-r--r-- 1 99 99  57 Feb 26 21:07 meta.properties
-rw-r--r-- 1 99 99   0 Feb 26 21:07 recovery-point-offset-checkpoint
-rw-r--r-- 1 99 99   0 Feb 26 21:07 replication-offset-checkpoint

#cat /opt/kafka/data/logs/meta.properties
#
#Mon Feb 26 21:07:08 UTC 2018
version=0
broker.id=1003
我们在卡夫卡日志中看到以下错误:

[2018-02-26 21:26:40,606] INFO [ThrottledRequestReaper-Produce], Starting      (kafka.server.ClientQuotaManager$ThrottledRequestReaper)
[2018-02-26 21:26:40,711] FATAL [Kafka Server 1002], Fatal error during         KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.io.IOException: Invalid argument
    at java.io.UnixFileSystem.createFileExclusively(Native Method)
    at java.io.File.createNewFile(File.java:1012)
    at kafka.utils.FileLock.<init>(FileLock.scala:28)
    at kafka.log.LogManager$$anonfun$lockLogDirs$1.apply(LogManager.scala:104)
    at kafka.log.LogManager$$anonfun$lockLogDirs$1.apply(LogManager.scala:103)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at kafka.log.LogManager.lockLogDirs(LogManager.scala:103)
    at kafka.log.LogManager.<init>(LogManager.scala:65)
    at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:648)
    at kafka.server.KafkaServer.startup(KafkaServer.scala:208)
    at io.confluent.support.metrics.SupportedServerStartable.startup(SupportedServerStartable.java:102)
    at io.confluent.support.metrics.SupportedKafka.main(SupportedKafka.java:49)
[2018-02-26 21:26:40,713] INFO [Kafka Server 1002], shutting down (kafka.server.KafkaServer)
[2018-02-26 21:26:40,715] INFO Terminate ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
[2018-02-26 21:26:40606]信息[ThrottledRequestRepeaper产品],开始(kafka.server.ClientQuotaManager$ThrottledRequestRepeaper)
[2018-02-26 21:26:40711]致命[卡夫卡服务器1002],卡夫卡服务器启动期间发生致命错误。准备关闭(kafka.server.KafkaServer)
java.io.IOException:参数无效
位于java.io.UnixFileSystem.createFileExclusive(本机方法)
位于java.io.File.createNewFile(File.java:1012)
在kafka.utils.FileLock。(FileLock.scala:28)
在kafka.log.LogManager$$anonfun$lockLogDirs$1.apply(LogManager.scala:104)
在kafka.log.LogManager$$anonfun$lockLogDirs$1.apply(LogManager.scala:103)
在scala.collection.TraversableLike$$anonfun$map$1.apply处(TraversableLike.scala:234)
在scala.collection.TraversableLike$$anonfun$map$1.apply处(TraversableLike.scala:234)
在scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
位于scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
位于scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
位于scala.collection.AbstractTraversable.map(Traversable.scala:104)
在kafka.log.LogManager.lockLogDirs(LogManager.scala:103)
在kafka.log.LogManager.(LogManager.scala:65)
位于kafka.server.KafkaServer.createLogManager(KafkaServer.scala:648)
在kafka.server.KafkaServer.startup(KafkaServer.scala:208)
在io.confluent.support.metrics.SupportedServerStartable.startup(SupportedServerStartable.java:102)
位于io.confluent.support.metrics.SupportedKafka.main(SupportedKafka.java:49)
[2018-02-26 21:26:40713]信息[Kafka服务器1002],正在关闭(Kafka.Server.KafkaServer)
[2018-02-26 21:26:40715]信息客户端事件线程。(org.I0Itec.zkclient.ZkEventThread)
解决这个问题的唯一方法似乎是删除持久卷声明并再次强制删除pod。或者使用NFS以外的另一个存储提供程序(rook在这种情况下工作正常)


有人在NFS provisioner中遇到过这个问题吗?

您找到了解决方案吗?没有,除了切换到另一个存储类(例如rook block或cinder)之外。这里还有一张票。假设NFS无法关闭所有文件句柄。