Apache spark ArrayIndexOutOfBoundsException将数据写入配置单元Spark SQL时发生异常

Apache spark ArrayIndexOutOfBoundsException将数据写入配置单元Spark SQL时发生异常,apache-spark,hadoop,hive,apache-spark-sql,hiveql,Apache Spark,Hadoop,Hive,Apache Spark Sql,Hiveql,我正在尝试处理文本并将其写入配置单元表。在插入过程中,我遇到以下错误: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 4, 127.0.0.1, executor 0): org.apache.spark.SparkException

我正在尝试处理文本并将其写入配置单元表。在插入过程中,我遇到以下错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 4, 127.0.0.1, executor 0): org.apache.spark.SparkException: Task failed while writing rows
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
    at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)
    at com.inndata.services.maintenance$$anonfun$2.apply(maintenance.scala:37)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
    ... 8 more
这是我的密码:

object maintenance {
  case class event(Entity_Status_Code:String,Entity_Status_Description:String,Status:String,Event_Date:String,Event_Date2:String,Event_Date3:String,Event_Description:String)
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("maintenance").setMaster("local")
    conf.set("spark.debug.maxToStringFields", "10000000")
    val context = new SparkContext(conf)
    val sqlContext = new SQLContext(context)
    val hiveContext = new HiveContext(context)
    sqlContext.clearCache()
    //hiveContext.clearCache()
    //sqlContext.clearCache()

    import hiveContext.implicits._
    val rdd = context.textFile("file:///Users/hadoop/Downloads/sample.txt").map(line => line.split(" ")).map(x => event(x(0),x(1),x(2),x(3),x(4),x(5),x(6)))

    val personDF = rdd.toDF()
    personDF.show(10)
    personDF.registerTempTable("Maintenance")
    hiveContext.sql("insert into table default.maintenance select Entity_Status_Code,Entity_Status_Description,Status,Event_Date,Event_Date2,Event_Date3,Event_Description from Maintenance")


  }
当我注释与hiveContext相关的所有行并在本地运行时(我指的是personDF.show()),工作正常。但当我运行spark submit并启用hiveContext时,会出现上述错误

以下是我的示例数据:

4287053 06218896 N 19801222 19810901 19881222 M171 
4287053 06218896 N 19801222 19810901 19850211 M170 
4289713 06222552 Y 19810105 19810915 19930330 SM02 
4289713 06222552 Y 19810105 19810915 19930303 M285 
4289713 06222552 Y 19810105 19810915 19921208 RMPN 
4289713 06222552 Y 19810105 19810915 19921208 ASPN 
4289713 06222552 Y 19810105 19810915 19881116 ASPN 
4289713 06222552 Y 19810105 19810915 19881107 M171

将-1添加到split,这将解决您的问题(在计算val rdd=…)的行上): 行。拆分(“,-1)


在导致arrayindexoutofbound的拆分中,将忽略空字段。

这显然是一个开发人员问题“由以下原因引起:java.lang.ArrayIndexOutOfBoundsException:1,位于com.inData.services.maintenance$$anonfun$2.apply(maintenance.scala:37)”