Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 错误:值toDF不是org.apache.spark.rdd.rdd[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]的成员_Scala_Apache Spark_Apache Kafka - Fatal编程技术网

Scala 错误:值toDF不是org.apache.spark.rdd.rdd[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]的成员

Scala 错误:值toDF不是org.apache.spark.rdd.rdd[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]的成员,scala,apache-spark,apache-kafka,Scala,Apache Spark,Apache Kafka,我正在尝试使用Scala中的sparkStreaming捕获Kafka事件(我以序列化的形式获取) 以下是我的代码片段: val spark = SparkSession.builder().master("local[*]").appName("Spark-Kafka-Integration").getOrCreate() spark.conf.set("spark.driver.allowMultipleContexts", "true") val sc = spark.sparkCont

我正在尝试使用Scala中的sparkStreaming捕获Kafka事件(我以序列化的形式获取)

以下是我的代码片段:

val spark = SparkSession.builder().master("local[*]").appName("Spark-Kafka-Integration").getOrCreate()
spark.conf.set("spark.driver.allowMultipleContexts", "true")

val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val topics=Set("<topic-name>")
val brokers="<some-list>"
val groupId="spark-streaming-test"

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> brokers,
  "auto.offset.reset" -> "earliest",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
  "group.id" -> groupId,
  "enable.auto.commit" -> (false: java.lang.Boolean)
)

val messages: InputDStream[ConsumerRecord[String, String]] =
  KafkaUtils.createDirectStream[String, String](
    ssc,
    LocationStrategies.PreferConsistent,
    ConsumerStrategies.Subscribe[String, String](topics, kafkaParams)
  )

messages.foreachRDD { rdd =>
  println(rdd.toDF())
}

ssc.start()
ssc.awaitTermination()
val spark=SparkSession.builder().master(“local[*]).appName(“spark-Kafka集成”).getOrCreate()
spark.conf.set(“spark.driver.allowMultipleContexts”,“true”)
val sc=spark.sparkContext
val ssc=新的StreamingContext(sc,秒(5))
val sqlContext=new org.apache.spark.sql.sqlContext(sc)
导入sqlContext.implicits_
val topics=Set(“”)
val brokers=“”
val groupId=“火花流测试”
val kafkaParams=Map[字符串,对象](
“bootstrap.servers”->代理,
“自动偏移重置”->“最早”,
“key.deserializer”->classOf[StringDeserializer],
“value.deserializer”->“org.apache.kafka.common.serialization.StringDeserializer”,
“group.id”->groupId,
“enable.auto.commit”->(false:java.lang.Boolean)
)
val消息:InputDStream[ConsumerRecord[String,String]]=
KafkaUtils.createDirectStream[字符串,字符串](
ssc,
位置策略。一致性,
ConsumerStrategies.Subscribe[String,String](主题,卡夫卡帕拉姆)
)
messages.foreachRDD{rdd=>
println(rdd.toDF())
}
ssc.start()
ssc.终止协议()
我收到的错误消息如下:
错误:(59,19)值toDF不是org.apache.spark.rdd.rdd[org.apache.kafka.clients.consumer.ConsumerRecord[String,String]]println(rdd.toDF())

toDF
通过
datamethoder

我还没有复制它,但我猜,
ConsumerRecord[String,String]
没有编码器,所以您可以提供一个编码器,或者首先将其映射到可以派生
编码器的对象(case类或原语)


另外,
foreachRDD
中的println可能不会按照您想要的方式运行,因为spark的分布式特性

谢谢。因为我对这一点还不熟悉,所以我不能理解太多。我替换了代码并尝试直接打印RDD,如下所示:messages.foreachRDD{RDD=>println(“messages:+RDD”)}它成功地运行,输出为:messages:KafkaRDD[1]在kafkaStream的createDirectStream.scala:51我想看到的是这些RDD背后的数据。对于可能无限的流,这有点棘手。您可以将数据转换为人类可读的格式,并保存到HDFS,然后在那里查看
println
在spark中非常有用,除非您将数据具体化到驱动程序节点。如何将信息提取成可读的格式取决于您的Kafkanconic的外观,但很可能您希望使用您感兴趣的字段创建一些case类,并将ConsumerRecord转换为该case类General。我建议您了解spark和RDD如何在引擎盖下工作以及如何分配计算,我将代码更改为:messages.foreachRDD{rdd=>val-dataFrame=rdd.map(row=>row.value()).toDF()dataFrame.show(10)},并将结果更改为:+-----------------------+|value |+-------------+|��srHcom.egenc…||��srPcom.egenc…||��srPcom.egenc…|+-----------------------+仅显示当前正在工作的前10行。只需要反序列化这些数据,使其可读。我在代码中使用了value.deserializer,但它不起作用。你有什么建议吗?
val df=messages.map(u.value)