Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/csharp-4.0/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala&;Spark:java.lang.ArrayStore反序列化异常_Scala_Apache Spark_Json4s - Fatal编程技术网

Scala&;Spark:java.lang.ArrayStore反序列化异常

Scala&;Spark:java.lang.ArrayStore反序列化异常,scala,apache-spark,json4s,Scala,Apache Spark,Json4s,我正在Scala&Spark中工作,加载一个大文件(60+GB)JSON文件并处理它。因为使用sparksession.read.json会导致内存不足异常,所以我选择了RDD路径 import org.json4s._ import org.json4s.jackson.JsonMethods._ import org.json4s.jackson.Serialization.{read} val submissions_rdd = sc.textFile("/home/user

我正在Scala&Spark中工作,加载一个大文件(60+GB)JSON文件并处理它。因为使用sparksession.read.json会导致内存不足异常,所以我选择了RDD路径

import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.json4s.jackson.Serialization.{read}

val submissions_rdd  = sc.textFile("/home/user/repos/concepts/abcde/RS_2019-09")
//val columns_subset = Set("author", "title", "selftext", "score", "created_utc", "subreddit")

case class entry(title: String, 
                 selftext: String, 
                 score: Double, 
                 created_utc: Double, 
                 subreddit: String, 
                 author: String)


def jsonExtractObject(jsonStr: String) = {
  implicit val formats = org.json4s.DefaultFormats

  read[entry](jsonStr)

}
在单个条目上测试我的函数后,我得到了期望的结果:

val res = jsonExtractObject(submissions_rdd.take(1)(0))
问题是,在尝试将相同的函数映射到RDD后,我遇到了一个错误:

val subset = submissions_rdd.map(line => jsonExtractObject(line) )
subset.take(5)
org.apache.spark.SparkDriverExecutionException:执行错误位于 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1485) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2236) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177) 位于org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775) 位于org.apache.spark.SparkContext.runJob(SparkContext.scala:2099) org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)位于 org.apache.spark.SparkContext.runJob(SparkContext.scala:2139)位于 org.apache.spark.rdd.rdd.$anonfun$take$1(rdd.scala:1423)在 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 在 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) 位于org.apache.spark.rdd.rdd.withScope(rdd.scala:388) org.apache.spark.rdd.rdd.take(rdd.scala:1396)。。。41被省略的原因 by:java.lang.ArrayStoreException:[Lentry;在 scala.runtime.scalarantime$.array_更新(scalarantime.scala:75)位于 org.apache.spark.SparkContext.$anonfun$runJob$4(SparkContext.scala:2120) 在 org.apache.spark.SparkContext.$anonfun$runJob$4$adapted(SparkContext.scala:2120) 在 org.apache.spark.scheduler.jobwater.taskSuccessed(jobwater.scala:59) 在 org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1481) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2236) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177) 位于org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)


如果您有任何关于如何解决此问题的提示,将不胜感激。谢谢!

使用FTW!它允许读取逗号分隔或换行符分隔的JSON对象数组,并在解析过程中使用回调函数处理它们,而无需在内存中保存所有内容。是相关的核心API调用。谢谢@andriyplokhotnuk,但我正在点击现在出现NotSerializableException。
隐式val编解码器:JsonValueCodec[entry]=JsonCodecMaker.make val subset=submissions\u rdd.map(line=>readFromArray(line.getBytes(“UTF-8”))
org.apache.spark.sparkeexception:任务不可序列化原因:java.io.NotSerializableException:$anon$1我最终使用了sparksession.read.json,但使用了显式定义的模式(以前使用隐式)。谢谢!
val subset = submissions_rdd.map(line => jsonExtractObject(line) )
subset.take(5)