Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 启用检查点的Spark流中的java.io.NotSerializableException_Apache Spark_Spark Streaming_Rdd - Fatal编程技术网

Apache spark 启用检查点的Spark流中的java.io.NotSerializableException

Apache spark 启用检查点的Spark流中的java.io.NotSerializableException,apache-spark,spark-streaming,rdd,Apache Spark,Spark Streaming,Rdd,代码如下: def main(args: Array[String]) { val sc = new SparkContext val sec = Seconds(3) val ssc = new StreamingContext(sc, sec) ssc.checkpoint("./checkpoint") val rdd = ssc.sparkContext.parallelize(Seq("a","b","c")) val inputDStr

代码如下:

def main(args: Array[String]) {
    val sc = new SparkContext
    val sec = Seconds(3)
    val ssc = new StreamingContext(sc, sec)
    ssc.checkpoint("./checkpoint")
    val rdd = ssc.sparkContext.parallelize(Seq("a","b","c"))
    val inputDStream = new ConstantInputDStream(ssc, rdd)

    inputDStream.transform(rdd => {
        val buf = ListBuffer[String]()
        buf += "1"
        buf += "2"
        buf += "3"
        val other_rdd = ssc.sparkContext.parallelize(buf)   // create a new rdd
        rdd.union(other_rdd)
    }).print()

    ssc.start()
    ssc.awaitTermination()
}
并抛出异常:

java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.streaming.StreamingContext
Serialization stack:
    - object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext@5626e185)
    - field (class: com.mirrtalk.Test$$anonfun$main$1, name: ssc$1, type: class org.apache.spark.streaming.StreamingContext)
    - object (class com.mirrtalk.Test$$anonfun$main$1, <function1>)
    - field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21, name: cleanedF$2, type: interface scala.Function1)
    - object (class org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21, <function2>)
    - field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5, name: cleanedF$3, type: interface scala.Function2)
    - object (class org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5, <function2>)
    - field (class: org.apache.spark.streaming.dstream.TransformedDStream, name: transformFunc, type: interface scala.Function2)
java.io.NotSerializableException:已启用数据流检查点,但数据流及其函数不可序列化
org.apache.spark.streaming.StreamingContext
序列化堆栈:
-对象不可序列化(类:org.apache.spark.streaming.StreamingContext,值:org.apache.spark.streaming)。StreamingContext@5626e185)
-字段(类:com.mirrtalk.Test$$anonfun$main$1,名称:ssc$1,类型:class org.apache.spark.streaming.StreamingContext)
-对象(类com.mirrtalk.Test$$anonfun$main$1,)
-字段(类:org.apache.spark.streaming.dstream.dstream$$anonfun$transform$1$$anonfun$apply$21,名称:cleanedF$2,类型:interface scala.Function1)
-对象(类org.apache.spark.streaming.dstream.dstream$$anonfun$transform$1$$anonfun$apply$21,)
-字段(类:org.apache.spark.streaming.dstream.dstream$$anonfun$transform$2$$anonfun$5,名称:cleanedF$3,类型:interface scala.Function2)
-对象(类org.apache.spark.streaming.dstream.dstream$$anonfun$transform$2$$anonfun$5,)
-字段(类:org.apache.spark.streaming.dstream.TransformedStream,名称:transformFunc,类型:interface scala.Function2)
当我删除代码ssc.checkpoint(“./checkpoint”)时,应用程序可以正常工作,但我需要启用检查点


当启用检查点时如何解决此问题?

您可以将上下文初始化和配置任务移到
main
之外:

对象应用程序{
val sc=new SparkContext(new SparkConf().setAppName(“foo”).setMaster(“local”))
瓦尔秒=秒(3)
val ssc=新的StreamingContext(sc,秒)
ssc.checkpoint(“./checkpoint”)//启用检查点
def main(参数:数组[字符串]){
val rdd=ssc.sparkContext.parallelize(Seq(“a”、“b”、“c”))
val输入数据流=新康斯坦丁输入数据流(ssc,rdd)
inputDStream.transform(rdd=>{
val buf=ListBuffer[String]()
buf+=“1”
buf+=“2”
buf+=“3”
val other_rdd=ssc.sparkContext.parallelize(buf)
union(其他rdd)//我想联合其他rdd
}).print()
ssc.start()
ssc.终止协议()
}
}

问题不在于
StreamingContext
无法序列化,并且他正在转换中使用它吗?@YuvalItzchakov这是我的第一个想法,但它没有在转换中使用(它只在流级别使用),所以这不是直接的问题。这里的问题似乎更微妙。在检查点期间,StreamingContext被拖下。是在驱动端还是在工作端调用
transform
?@YuvalItzchakov-driver-side。由于
transform
RDD
作为输入,因此从逻辑上讲,它不能在worker上运行。您只能访问驱动程序上的
RDD