在Spark Scala中使用自定义数据帧类时任务不可序列化
我在Scala/Spark(1.5)和齐柏林飞艇上遇到了一个奇怪的问题: 如果我运行以下Scala/Spark代码,它将正常运行:在Spark Scala中使用自定义数据帧类时任务不可序列化,scala,apache-spark,task,implicit,serializable,Scala,Apache Spark,Task,Implicit,Serializable,我在Scala/Spark(1.5)和齐柏林飞艇上遇到了一个奇怪的问题: 如果我运行以下Scala/Spark代码,它将正常运行: // TEST NO PROBLEM SERIALIZATION val rdd = sc.parallelize(Seq(1, 2, 3)) val testList = List[String]("a", "b") rdd.map{a => val aa = testList(0) None} 但是,在按照建议声明自定义数据帧类型之后
// TEST NO PROBLEM SERIALIZATION
val rdd = sc.parallelize(Seq(1, 2, 3))
val testList = List[String]("a", "b")
rdd.map{a =>
val aa = testList(0)
None}
但是,在按照建议声明自定义数据帧类型之后
并以其为例,如下所示:
//READ ALL THE FILES INTO different DF and save into map
import ExtraDataFrameOperations._
val filename = "myInput.csv"
val delimiter = ","
val colToIgnore = Seq("c_9", "c_10")
val inputICFfolder = "hdfs:///group/project/TestSpark/"
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "false") // Automatically infer data types? => no cause we need to merge all df, with potential null values => keep string only
.option("delimiter", delimiter)
.option("charset", "UTF-8")
.load(inputICFfolder + filename)
.drop(colToIgnore)//call the customize dataframe
此操作已成功运行
现在,如果我再次运行以下代码(与上面相同)
我收到错误消息:
rdd:org.apache.spark.rdd.rdd[Int]=ParallelCollectionRDD[8]at
并行化at:32 testList:List[String]=List(a,b)
org.apache.spark.SparkException:无法在上序列化任务
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
在
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
在
org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
位于org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
org.apache.spark.rdd.rdd$$anonfun$map$1.apply(rdd.scala:314)
...
原因:java.io.NotSerializableException:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$ExtraDataFrameOperations$
序列化堆栈:-对象不可序列化(类:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$ExtraDataFrameOperations$,
价值:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$ExtraDataFrameOperations$@6c7e70e)
-字段(类:$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC,名称:ExtraDataFrameOperations$模块,类型:class
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$ExtraDataFrameOperations$)
-对象(类$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC,$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC@4c6d0802)-字段(类别:
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC,名称:$iw,类型:class
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
我不明白:
- 为什么在未对数据帧执行任何操作时发生此错误
- 为什么“ExtraDataFrameOperations”在以前成功使用时不可序列化
@inline val testList = List[String]("a", "b")
没有帮助。看起来spark试图序列化
testList
周围的所有范围。
尝试内联数据
@inline val testList=List[String](“a”,“b”)
或使用不同的对象存储传递给驱动程序的函数/数据。看起来spark试图序列化testList
周围的所有范围。
尝试内联数据
@inline val testList=List[String](“a”、“b”)
或使用不同的对象来存储传递给驱动程序的函数/数据。只需添加“extends Serializable”
这是我的工作
/**
* A wrapper around ProducerRecord RDD that allows to save RDD to Kafka.
*
* KafkaProducer is shared within all threads in one executor.
* Error handling strategy - remember "last" seen exception and rethrow it to allow task fail.
*/
implicit class DatasetKafkaSink(ds: Dataset[ProducerRecord[String, GenericRecord]]) extends Serializable {
class ExceptionRegisteringCallback extends Callback {
private[this] val lastRegisteredException = new AtomicReference[Option[Exception]](None)
override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
Option(exception) match {
case a @ Some(_) => lastRegisteredException.set(a) // (re)-register exception if send failed
case _ => // do nothing if encountered successful send
}
}
def rethrowException(): Unit = lastRegisteredException.getAndSet(None).foreach(e => throw e)
}
/**
* Save to Kafka reusing KafkaProducer from singleton holder.
* Returns back control only once all records were actually sent to Kafka, in case of error rethrows "last" seen
* exception in the same thread to allow Spark task to fail
*/
def saveToKafka(kafkaProducerConfigs: Map[String, AnyRef]): Unit = {
ds.foreachPartition { records =>
val callback = new ExceptionRegisteringCallback
val producer = KafkaProducerHolder.getInstance(kafkaProducerConfigs)
records.foreach(record => producer.send(record, callback))
producer.flush()
callback.rethrowException()
}
}
}'
只需添加“扩展可序列化” 这是我的工作
/**
* A wrapper around ProducerRecord RDD that allows to save RDD to Kafka.
*
* KafkaProducer is shared within all threads in one executor.
* Error handling strategy - remember "last" seen exception and rethrow it to allow task fail.
*/
implicit class DatasetKafkaSink(ds: Dataset[ProducerRecord[String, GenericRecord]]) extends Serializable {
class ExceptionRegisteringCallback extends Callback {
private[this] val lastRegisteredException = new AtomicReference[Option[Exception]](None)
override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
Option(exception) match {
case a @ Some(_) => lastRegisteredException.set(a) // (re)-register exception if send failed
case _ => // do nothing if encountered successful send
}
}
def rethrowException(): Unit = lastRegisteredException.getAndSet(None).foreach(e => throw e)
}
/**
* Save to Kafka reusing KafkaProducer from singleton holder.
* Returns back control only once all records were actually sent to Kafka, in case of error rethrows "last" seen
* exception in the same thread to allow Spark task to fail
*/
def saveToKafka(kafkaProducerConfigs: Map[String, AnyRef]): Unit = {
ds.foreachPartition { records =>
val callback = new ExceptionRegisteringCallback
val producer = KafkaProducerHolder.getInstance(kafkaProducerConfigs)
records.foreach(record => producer.send(record, callback))
producer.flush()
callback.rethrowException()
}
}
}'
不幸的是@inline没有帮助,在其他对象中存储函数/数据并不真正适合自定义数据框对象的策略不幸的是@inline没有帮助,在其他对象中存储函数/数据并不真正适合自定义数据框对象的策略
/**
* A wrapper around ProducerRecord RDD that allows to save RDD to Kafka.
*
* KafkaProducer is shared within all threads in one executor.
* Error handling strategy - remember "last" seen exception and rethrow it to allow task fail.
*/
implicit class DatasetKafkaSink(ds: Dataset[ProducerRecord[String, GenericRecord]]) extends Serializable {
class ExceptionRegisteringCallback extends Callback {
private[this] val lastRegisteredException = new AtomicReference[Option[Exception]](None)
override def onCompletion(metadata: RecordMetadata, exception: Exception): Unit = {
Option(exception) match {
case a @ Some(_) => lastRegisteredException.set(a) // (re)-register exception if send failed
case _ => // do nothing if encountered successful send
}
}
def rethrowException(): Unit = lastRegisteredException.getAndSet(None).foreach(e => throw e)
}
/**
* Save to Kafka reusing KafkaProducer from singleton holder.
* Returns back control only once all records were actually sent to Kafka, in case of error rethrows "last" seen
* exception in the same thread to allow Spark task to fail
*/
def saveToKafka(kafkaProducerConfigs: Map[String, AnyRef]): Unit = {
ds.foreachPartition { records =>
val callback = new ExceptionRegisteringCallback
val producer = KafkaProducerHolder.getInstance(kafkaProducerConfigs)
records.foreach(record => producer.send(record, callback))
producer.flush()
callback.rethrowException()
}
}
}'