Apache spark 启用检查点的Spark流中的java.io.NotSerializableException_Apache Spark_Spark Streaming_Rdd

Apache spark 启用检查点的Spark流中的java.io.NotSerializableException

apache-spark

Apache spark 启用检查点的Spark流中的java.io.NotSerializableException,apache-spark,spark-streaming,rdd,Apache Spark,Spark Streaming,Rdd,代码如下： def main(args: Array[String]) { val sc = new SparkContext val sec = Seconds(3) val ssc = new StreamingContext(sc, sec) ssc.checkpoint("./checkpoint") val rdd = ssc.sparkContext.parallelize(Seq("a","b","c")) val inputDStr

代码如下：

def main(args: Array[String]) {
    val sc = new SparkContext
    val sec = Seconds(3)
    val ssc = new StreamingContext(sc, sec)
    ssc.checkpoint("./checkpoint")
    val rdd = ssc.sparkContext.parallelize(Seq("a","b","c"))
    val inputDStream = new ConstantInputDStream(ssc, rdd)

    inputDStream.transform(rdd => {
        val buf = ListBuffer[String]()
        buf += "1"
        buf += "2"
        buf += "3"
        val other_rdd = ssc.sparkContext.parallelize(buf)   // create a new rdd
        rdd.union(other_rdd)
    }).print()

    ssc.start()
    ssc.awaitTermination()
}

并抛出异常：

java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.streaming.StreamingContext
Serialization stack:
    - object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext@5626e185)
    - field (class: com.mirrtalk.Test$$anonfun$main$1, name: ssc$1, type: class org.apache.spark.streaming.StreamingContext)
    - object (class com.mirrtalk.Test$$anonfun$main$1, <function1>)
    - field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21, name: cleanedF$2, type: interface scala.Function1)
    - object (class org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21, <function2>)
    - field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5, name: cleanedF$3, type: interface scala.Function2)
    - object (class org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5, <function2>)
    - field (class: org.apache.spark.streaming.dstream.TransformedDStream, name: transformFunc, type: interface scala.Function2)

java.io.NotSerializableException:已启用数据流检查点，但数据流及其函数不可序列化
org.apache.spark.streaming.StreamingContext
序列化堆栈：
-对象不可序列化（类：org.apache.spark.streaming.StreamingContext，值：org.apache.spark.streaming）。StreamingContext@5626e185)
-字段（类：com.mirrtalk.Test$$anonfun$main$1，名称：ssc$1，类型：class org.apache.spark.streaming.StreamingContext）
-对象（类com.mirrtalk.Test$$anonfun$main$1，）
-字段（类：org.apache.spark.streaming.dstream.dstream$$anonfun$transform$1$$anonfun$apply$21，名称：cleanedF$2，类型：interface scala.Function1）
-对象（类org.apache.spark.streaming.dstream.dstream$$anonfun$transform$1$$anonfun$apply$21，）
-字段（类：org.apache.spark.streaming.dstream.dstream$$anonfun$transform$2$$anonfun$5，名称：cleanedF$3，类型：interface scala.Function2）
-对象（类org.apache.spark.streaming.dstream.dstream$$anonfun$transform$2$$anonfun$5，）
-字段（类：org.apache.spark.streaming.dstream.TransformedStream，名称：transformFunc，类型：interface scala.Function2）

当我删除代码ssc.checkpoint（“./checkpoint”）时，应用程序可以正常工作，但我需要启用检查点

当启用检查点时如何解决此问题？

您可以将上下文初始化和配置任务移到

main

之外：

对象应用程序{
val sc=new SparkContext（new SparkConf（）.setAppName（“foo”）.setMaster（“local”））
瓦尔秒=秒（3）
val ssc=新的StreamingContext（sc，秒）
ssc.checkpoint（“./checkpoint”）//启用检查点
def main（参数：数组[字符串]）{
val rdd=ssc.sparkContext.parallelize（Seq（“a”、“b”、“c”））
val输入数据流=新康斯坦丁输入数据流（ssc，rdd）
inputDStream.transform（rdd=>{
val buf=ListBuffer[String]（）
buf+=“1”
buf+=“2”
buf+=“3”
val other_rdd=ssc.sparkContext.parallelize（buf）
union（其他rdd）//我想联合其他rdd
}).print（）
ssc.start（）
ssc.终止协议（）
}
}

问题不在于

StreamingContext

无法序列化，并且他正在转换中使用它吗？@YuvalItzchakov这是我的第一个想法，但它没有在转换中使用（它只在流级别使用），所以这不是直接的问题。这里的问题似乎更微妙。在检查点期间，StreamingContext被拖下。是在驱动端还是在工作端调用

transform

？@YuvalItzchakov-driver-side。由于

transform

将

RDD

作为输入，因此从逻辑上讲，它不能在worker上运行。您只能访问驱动程序上的

RDD

。