Apache spark 使用Spark 1.4 API读取ORC文件时的NPE
我在Spark中阅读了许多ORC文件并对其进行了处理,这些文件基本上都是配置单元分区。大多数情况下,处理都很顺利,但对于少数文件,我会出现以下异常,我不知道为什么?这些文件在使用配置单元查询的配置单元中工作正常Apache spark 使用Spark 1.4 API读取ORC文件时的NPE,apache-spark,hive,apache-spark-sql,orc,Apache Spark,Hive,Apache Spark Sql,Orc,我在Spark中阅读了许多ORC文件并对其进行了处理,这些文件基本上都是配置单元分区。大多数情况下,处理都很顺利,但对于少数文件,我会出现以下异常,我不知道为什么?这些文件在使用配置单元查询的配置单元中工作正常 DataFrame df = hiveContext.read().format("orc").load("/path/in/hdfs"); java.lang.NullPointerException at org.apache.spa
DataFrame df = hiveContext.read().format("orc").load("/path/in/hdfs");
java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:402)
at org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:206)
at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$8.apply(OrcRelation.scala:238)
at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$8.apply(OrcRelation.scala:238)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:238)
at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:290)
at org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:288)
at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
NullPointerException总是一个bug。不是在您的代码中,而是在您正在使用的java程序中。因此,请向apache spark提交一个bug。我在spark 1.6.1上也遇到了同样的错误。我们还没有找到问题的根源,但是第一个发现是只有一些配置单元分区没有返回数据(尽管它们使用配置单元本身工作并返回非常好)。这意味着,如果删除分区过滤器,或查询其他表,所有这些都看起来不错。如果不遵循spark的目录结构,则会出现此错误。 考虑一个名为“分区表”的表,该分区是在分区Cyr1、Buffice COL2、Buffice COL3上进行分区的。 hdfs dfs-ls/path/in/hdfs/partitionedtable/ /路径/in/hdfs/partitionedtable/partitionCol1=1/partitionCol2=11/partitionCol3=111/part-00000 /路径/in/hdfs/partitionedtable/partitionCol1=2/partitionCol2=22/partitionCol3=222/part-00001 /path/in/hdfs/partitionedtable/partitionCol1=3/-->这一个没有任何数据 参考: