Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark:java.io.NotSerializableException:com.amazonaws.services.s3.AmazonS3Client_Scala_Amazon Web Services_Apache Spark_Amazon S3 - Fatal编程技术网

Scala Spark:java.io.NotSerializableException:com.amazonaws.services.s3.AmazonS3Client

Scala Spark:java.io.NotSerializableException:com.amazonaws.services.s3.AmazonS3Client,scala,amazon-web-services,apache-spark,amazon-s3,Scala,Amazon Web Services,Apache Spark,Amazon S3,我正在尝试从S3中读取大量的大型文件,如果将其作为Dataframe函数执行,则需要相当长的时间。下面我将尝试使用RDD并行读取s3对象,如下所示 def dfFromS3Objects(s3: AmazonS3, bucket: String, prefix: String, pageLength: Int = 1000) = { import com.amazonaws.services.s3._ import model._ import spark.sqlCont

我正在尝试从S3中读取大量的大型文件,如果将其作为Dataframe函数执行,则需要相当长的时间。下面我将尝试使用RDD并行读取s3对象,如下所示

def dfFromS3Objects(s3: AmazonS3, bucket: String, prefix: String, pageLength: Int = 1000) = {
    import com.amazonaws.services.s3._
    import model._
    import spark.sqlContext.implicits._

    import scala.collection.JavaConversions._

    val request = new ListObjectsRequest()
    request.setBucketName(bucket)
    request.setPrefix(prefix)
    request.setMaxKeys(pageLength)

    val objs: ObjectListing = s3.listObjects(request) // Note that this method returns truncated data if longer than the "pageLength" above. You might need to deal with that.

    spark.sparkContext.parallelize(objs.getObjectSummaries.map(_.getKey).toList)
      .flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }.toDF()
  }
当测试时,结果是

Caused by: java.io.NotSerializableException: com.amazonaws.services.s3.AmazonS3Client
Serialization stack:
    - object not serializable (class: com.amazonaws.services.s3.AmazonS3Client, value: com.amazonaws.services.s3.AmazonS3Client@35c8be21)
    - field (class: de.smava.data.bards.anonymize.HistoricalBardAnonymization$$anonfun$dfFromS3Objects$2, name: s3$1, type: interface com.amazonaws.services.s3.AmazonS3)
    - object (class de.smava.data.bards.anonymize.HistoricalBardAnonymization$$anonfun$dfFromS3Objects$2, <function1>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
    ... 63 more
原因:java.io.notserializableeexception:com.amazonaws.services.s3.AmazonS3Client
序列化堆栈:
-对象不可序列化(类:com.amazonaws.services.s3.AmazonS3Client,值:com.amazonaws.services.s3)。AmazonS3Client@35c8be21)
-字段(类:de.smava.data.bards.anonymize.HistoricalBardAnonymization$$anonfun$dfroms3objects$2,名称:s3$1,类型:interface com.amazonaws.services.s3.AmazonS3)
-对象(类de.smava.data.bards.anonymize.HistoricalBardAnonymization$$anonfun$dfroms3objects$2,)
位于org.apache.spark.serializer.SerializationDebugger$.ImproveeException(SerializationDebugger.scala:40)
位于org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
位于org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
位于org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:342)
... 63多

我知道我提供的
AmazonS3
对象需要发送给执行者,因此需要序列化,但这是一个示例片段,表示有人让它工作,需要帮助才能弄清楚我在这里遗漏了什么

在要点中,
s3
被定义为一种方法,它将为每次调用创建一个新的客户端。不建议这样做。解决此问题的一种方法是使用
mapPartitions

spark
  .sparkContext
  .parallelize(objs.getObjectSummaries.map(_.getKey).toList)
  .mapPartitions { it =>
    val s3 = ... // init the client here
    it.flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
  }
  .toDF
这仍然会为每个JVM创建多个客户机,但可能比为每个对象创建一个客户机的版本要少得多。如果希望在JVM内的线程之间重用客户机,例如,可以将其包装在顶级对象中

object Foo {
  val s3 = ...
}
并为客户端使用静态配置