Scala Spark:java.io.NotSerializableException:com.amazonaws.services.s3.AmazonS3Client
我正在尝试从S3中读取大量的大型文件,如果将其作为Dataframe函数执行,则需要相当长的时间。下面我将尝试使用RDD并行读取s3对象,如下所示Scala Spark:java.io.NotSerializableException:com.amazonaws.services.s3.AmazonS3Client,scala,amazon-web-services,apache-spark,amazon-s3,Scala,Amazon Web Services,Apache Spark,Amazon S3,我正在尝试从S3中读取大量的大型文件,如果将其作为Dataframe函数执行,则需要相当长的时间。下面我将尝试使用RDD并行读取s3对象,如下所示 def dfFromS3Objects(s3: AmazonS3, bucket: String, prefix: String, pageLength: Int = 1000) = { import com.amazonaws.services.s3._ import model._ import spark.sqlCont
def dfFromS3Objects(s3: AmazonS3, bucket: String, prefix: String, pageLength: Int = 1000) = {
import com.amazonaws.services.s3._
import model._
import spark.sqlContext.implicits._
import scala.collection.JavaConversions._
val request = new ListObjectsRequest()
request.setBucketName(bucket)
request.setPrefix(prefix)
request.setMaxKeys(pageLength)
val objs: ObjectListing = s3.listObjects(request) // Note that this method returns truncated data if longer than the "pageLength" above. You might need to deal with that.
spark.sparkContext.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }.toDF()
}
当测试时,结果是
Caused by: java.io.NotSerializableException: com.amazonaws.services.s3.AmazonS3Client
Serialization stack:
- object not serializable (class: com.amazonaws.services.s3.AmazonS3Client, value: com.amazonaws.services.s3.AmazonS3Client@35c8be21)
- field (class: de.smava.data.bards.anonymize.HistoricalBardAnonymization$$anonfun$dfFromS3Objects$2, name: s3$1, type: interface com.amazonaws.services.s3.AmazonS3)
- object (class de.smava.data.bards.anonymize.HistoricalBardAnonymization$$anonfun$dfFromS3Objects$2, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
... 63 more
原因:java.io.notserializableeexception:com.amazonaws.services.s3.AmazonS3Client
序列化堆栈:
-对象不可序列化(类:com.amazonaws.services.s3.AmazonS3Client,值:com.amazonaws.services.s3)。AmazonS3Client@35c8be21)
-字段(类:de.smava.data.bards.anonymize.HistoricalBardAnonymization$$anonfun$dfroms3objects$2,名称:s3$1,类型:interface com.amazonaws.services.s3.AmazonS3)
-对象(类de.smava.data.bards.anonymize.HistoricalBardAnonymization$$anonfun$dfroms3objects$2,)
位于org.apache.spark.serializer.SerializationDebugger$.ImproveeException(SerializationDebugger.scala:40)
位于org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
位于org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
位于org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
... 63多
我知道我提供的
AmazonS3
对象需要发送给执行者,因此需要序列化,但这是一个示例片段,表示有人让它工作,需要帮助才能弄清楚我在这里遗漏了什么在要点中,s3
被定义为一种方法,它将为每次调用创建一个新的客户端。不建议这样做。解决此问题的一种方法是使用mapPartitions
spark
.sparkContext
.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
.mapPartitions { it =>
val s3 = ... // init the client here
it.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
}
.toDF
这仍然会为每个JVM创建多个客户机,但可能比为每个对象创建一个客户机的版本要少得多。如果希望在JVM内的线程之间重用客户机,例如,可以将其包装在顶级对象中
object Foo {
val s3 = ...
}
并为客户端使用静态配置