Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/apache-kafka/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 如何将Spark Streaming与Kafka和Kerberos一起使用?_Apache Spark_Apache Kafka_Spark Streaming_Kerberos_Jaas - Fatal编程技术网

Apache spark 如何将Spark Streaming与Kafka和Kerberos一起使用?

Apache spark 如何将Spark Streaming与Kafka和Kerberos一起使用?,apache-spark,apache-kafka,spark-streaming,kerberos,jaas,Apache Spark,Apache Kafka,Spark Streaming,Kerberos,Jaas,我在使用Kerberized Hadoop集群中的Spark流应用程序使用Kafka的消息时遇到了一些问题。我尝试了两种方法: 基于接收器的方法:KafkaUtils.createStream 直接进近(无接收机):KafkaUtils.createDirectStream 基于接收器的方法(KafkaUtils.createStream)抛出两种类型的异常(不同的异常,无论我处于本地模式(--master local[*])还是处于纱线模式(--master warn--deploy mo

我在使用Kerberized Hadoop集群中的Spark流应用程序使用Kafka的消息时遇到了一些问题。我尝试了两种方法:

  • 基于接收器的方法:
    KafkaUtils.createStream
  • 直接进近(无接收机):
    KafkaUtils.createDirectStream
基于接收器的方法(
KafkaUtils.createStream
)抛出两种类型的异常(不同的异常,无论我处于本地模式(
--master local[*]
)还是处于纱线模式(
--master warn--deploy mode client
):

  • Spark本地应用程序中的一个奇怪的
    kafka.common.BrokerEndPointNotAvailableException
  • Spark on Thread应用程序中的Zookeeper超时。我曾经成功地实现了这一点(成功连接到Zookeeper),但没有收到任何消息
在两种模式(本地或纱线)中,直接方法(
KafkaUtils.createDirectStream
)返回一个无法解释的
EOFEException
(请参见下面的详细信息)

我的最终目标是在纱线上推出一个Spark流媒体工作,所以我将把Spark本地工作放在一边

以下是我的测试环境:

  • Cloudera CDH 5.7.0
  • Spark 1.6.0
  • 卡夫卡0.10.1.0
出于测试目的,我正在使用单节点群集(主机名=
quickstart.cloudera
)进行测试。对于那些有兴趣复制测试的人,我正在使用基于
cloudera/quickstart
()的自定义Docker容器

下面是我在
spark shell
中使用的示例代码。当然,此代码在未启用Kerberos时有效:spark应用程序接收由
kafka console producer
生成的消息

import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel
import kafka.serializer.StringDecoder

val ssc = new StreamingContext(sc, Seconds(5))

val topics = Map("test-kafka" -> 1)

def readFromKafkaReceiver(): Unit = {
    val kafkaParams = Map(
        "zookeeper.connect" -> "quickstart.cloudera:2181",
        "group.id" -> "gid1",
        "client.id" -> "cid1",
        "zookeeper.session.timeout.ms" -> "5000",
        "zookeeper.connection.timeout.ms" -> "5000"
    )

    val stream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics, StorageLevel.MEMORY_ONLY_2)
    stream.print
}

def readFromKafkaDirectStream(): Unit = {
    val kafkaDirectParams = Map(
        "bootstrap.servers" -> "quickstart.cloudera:9092",
        "group.id" -> "gid1",
        "client.id" -> "cid1"
    )

    val directStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaDirectParams, topics.map(_._1).toSet)
    directStream.print
}

readFromKafkaReceiver() // or readFromKafkaDirectStream()

ssc.start

Thread.sleep(20000)

ssc.stop(stopSparkContext = false, stopGracefully = true)
启用Kerberos后,此代码无法运行。我按照以下指南创建了两个配置文件:

jaas.conf

KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/home/simpleuser/simpleuser.keytab"
principal="simpleuser@CLOUDERA";
};
客户端属性

security.protocol=SASL_PLAINTEXT
sasl.kerberos.service.name=kafka
我可以通过以下方式生成消息:

export KAFKA_OPTS="-Djava.security.auth.login.config=/home/simpleuser/jaas.conf"
kafka-console-producer \
    --broker-list quickstart.cloudera:9092 \
    --topic test-kafka \
    --producer.config client.properties
但我无法使用Spark流媒体应用程序中的这些消息。要在
Warn client
模式下启动
Spark shell
,我只创建了一个新的JAAS配置(
JAAS\u with_zk\u warn.conf
),其中有一个Zookeeper部分(
client
),并且引用的键表仅是文件名(然后通过
--keytab
选项传递keytab):

此新文件在
--files
选项中传递:

KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="simpleuser.keytab"
principal="simpleuser@CLOUDERA";
};

Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="simpleuser.keytab"
principal="simpleuser@CLOUDERA";
};
spark-shell --master yarn --deploy-mode client \
    --num-executors 2 \
    --files /home/simpleuser/jaas_with_zk_yarn.conf \
    --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas_with_zk_yarn.conf" \
    --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=jaas_with_zk_yarn.conf" \
    --keytab /home/simpleuser/simpleuser.keytab \
    --principal simpleuser
我使用了与前面相同的代码,只是添加了另外两个Kafka参数,对应于
consumer.properties
文件的内容:

"security.protocol" -> "SASL_PLAINTEXT",
"sasl.kerberos.service.name" -> "kafka"
readfromKafkareReceiver()
启动Spark流媒体上下文后会引发以下错误(无法连接到Zookeeper):

没有更多的解释,只有一个
EOFEException
。我认为Spark和Kafka broker之间存在通信问题,但没有更多的解释。我还尝试了
metadata.broker.list
而不是
bootstrap.servers
,但没有成功

也许我在JAAS配置文件或卡夫卡参数中遗漏了什么?也许Spark选项(
extraJavaOptions
)无效?我尝试了太多的可能性,但有点迷路了


如果有人能帮助我解决其中至少一个问题(直接方法或基于接收器),我将非常高兴。谢谢:)

Spark 1.6不支持它,正如Cloudera文档中所述:

Spark Streaming在开始使用Kafka 0.9 Consumer API之前无法使用secure Kafka

1.6中的Spark streaming使用旧的消费者API,不支持安全消费

您可以使用支持安全卡夫卡的Spark 2.1:

ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 5000
        at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1223)
        at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:155)
        at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:129)
        at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:89)
        at kafka.utils.ZkUtils$.apply(ZkUtils.scala:71)
        at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:191)
        at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:139)
        at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:156)
        at kafka.consumer.Consumer$.create(ConsumerConnector.scala:109)
        at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
        at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
        at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
        at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:575)
        at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:565)
        at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2003)
        at org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:2003)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: java.io.EOFException
        at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
        at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
        at scala.util.Either.fold(Either.scala:97)
        at org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
        at org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
        at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.readFromKafkaDirectStream(<console>:47)