Apache spark 启用检查点的Spark流中的java.io.NotSerializableException
代码如下:Apache spark 启用检查点的Spark流中的java.io.NotSerializableException,apache-spark,spark-streaming,rdd,Apache Spark,Spark Streaming,Rdd,代码如下: def main(args: Array[String]) { val sc = new SparkContext val sec = Seconds(3) val ssc = new StreamingContext(sc, sec) ssc.checkpoint("./checkpoint") val rdd = ssc.sparkContext.parallelize(Seq("a","b","c")) val inputDStr
def main(args: Array[String]) {
val sc = new SparkContext
val sec = Seconds(3)
val ssc = new StreamingContext(sc, sec)
ssc.checkpoint("./checkpoint")
val rdd = ssc.sparkContext.parallelize(Seq("a","b","c"))
val inputDStream = new ConstantInputDStream(ssc, rdd)
inputDStream.transform(rdd => {
val buf = ListBuffer[String]()
buf += "1"
buf += "2"
buf += "3"
val other_rdd = ssc.sparkContext.parallelize(buf) // create a new rdd
rdd.union(other_rdd)
}).print()
ssc.start()
ssc.awaitTermination()
}
并抛出异常:
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.streaming.StreamingContext
Serialization stack:
- object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext@5626e185)
- field (class: com.mirrtalk.Test$$anonfun$main$1, name: ssc$1, type: class org.apache.spark.streaming.StreamingContext)
- object (class com.mirrtalk.Test$$anonfun$main$1, <function1>)
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21, name: cleanedF$2, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21, <function2>)
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5, name: cleanedF$3, type: interface scala.Function2)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5, <function2>)
- field (class: org.apache.spark.streaming.dstream.TransformedDStream, name: transformFunc, type: interface scala.Function2)
java.io.NotSerializableException:已启用数据流检查点,但数据流及其函数不可序列化
org.apache.spark.streaming.StreamingContext
序列化堆栈:
-对象不可序列化(类:org.apache.spark.streaming.StreamingContext,值:org.apache.spark.streaming)。StreamingContext@5626e185)
-字段(类:com.mirrtalk.Test$$anonfun$main$1,名称:ssc$1,类型:class org.apache.spark.streaming.StreamingContext)
-对象(类com.mirrtalk.Test$$anonfun$main$1,)
-字段(类:org.apache.spark.streaming.dstream.dstream$$anonfun$transform$1$$anonfun$apply$21,名称:cleanedF$2,类型:interface scala.Function1)
-对象(类org.apache.spark.streaming.dstream.dstream$$anonfun$transform$1$$anonfun$apply$21,)
-字段(类:org.apache.spark.streaming.dstream.dstream$$anonfun$transform$2$$anonfun$5,名称:cleanedF$3,类型:interface scala.Function2)
-对象(类org.apache.spark.streaming.dstream.dstream$$anonfun$transform$2$$anonfun$5,)
-字段(类:org.apache.spark.streaming.dstream.TransformedStream,名称:transformFunc,类型:interface scala.Function2)
当我删除代码ssc.checkpoint(“./checkpoint”)时,应用程序可以正常工作,但我需要启用检查点
当启用检查点时如何解决此问题?您可以将上下文初始化和配置任务移到
main
之外:
对象应用程序{
val sc=new SparkContext(new SparkConf().setAppName(“foo”).setMaster(“local”))
瓦尔秒=秒(3)
val ssc=新的StreamingContext(sc,秒)
ssc.checkpoint(“./checkpoint”)//启用检查点
def main(参数:数组[字符串]){
val rdd=ssc.sparkContext.parallelize(Seq(“a”、“b”、“c”))
val输入数据流=新康斯坦丁输入数据流(ssc,rdd)
inputDStream.transform(rdd=>{
val buf=ListBuffer[String]()
buf+=“1”
buf+=“2”
buf+=“3”
val other_rdd=ssc.sparkContext.parallelize(buf)
union(其他rdd)//我想联合其他rdd
}).print()
ssc.start()
ssc.终止协议()
}
}
问题不在于StreamingContext
无法序列化,并且他正在转换中使用它吗?@YuvalItzchakov这是我的第一个想法,但它没有在转换中使用(它只在流级别使用),所以这不是直接的问题。这里的问题似乎更微妙。在检查点期间,StreamingContext被拖下。是在驱动端还是在工作端调用transform
?@YuvalItzchakov-driver-side。由于transform
将RDD
作为输入,因此从逻辑上讲,它不能在worker上运行。您只能访问驱动程序上的RDD
。