Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 读取带有损坏地图的spark ORC文件_Scala_Apache Spark_Orc - Fatal编程技术网

Scala 读取带有损坏地图的spark ORC文件

Scala 读取带有损坏地图的spark ORC文件,scala,apache-spark,orc,Scala,Apache Spark,Orc,我的hdfs中有orc文件。其中一个字段是Map(String,String)。不知何故,此字段中有一些行具有值映射(null,null)。映射键中的null对于java来说是一个严重错误。所以,当我试图访问这个字段时,我得到了NullPointer exeption 我想读取这些文件并将此字段更改为emty map 我试着这样做: val df = spark.read.format("orc").load("/tmp/bad_orc") def func(s: org.apache.spa

我的hdfs中有orc文件。其中一个字段是Map(String,String)。不知何故,此字段中有一些行具有值映射(null,null)。映射键中的null对于java来说是一个严重错误。所以,当我试图访问这个字段时,我得到了NullPointer exeption

我想读取这些文件并将此字段更改为emty map

我试着这样做:

val df = spark.read.format("orc").load("/tmp/bad_orc")

def func(s: org.apache.spark.sql.Row): String = { 
    try
    {
        if ( s(14) == null ) // the 14'th column is the column with Map(String,String) type
        {
            return "Ok"
        }
        else
        {
            return "Zero"
        }
    }
    catch
    {
        case x: Exception => return "Erro"
    }
}
df.rdd.map(func).take(20)
当我运行这个脚本时,出现了这个异常

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 417.0 failed 4 times, most recent failure: Lost task 0.3 in stage 417.0 (TID 97094, srvg1076.local.odkl.ru, executor 86): java.lang.NullPointerException
    at java.util.TreeMap.compare(TreeMap.java:1294)
    at java.util.TreeMap.put(TreeMap.java:538)
    at org.apache.orc.mapred.OrcMapredRecordReader.nextMap(OrcMapredRecordReader.java:507)
    at org.apache.orc.mapred.OrcMapredRecordReader.nextValue(OrcMapredRecordReader.java:554)
    at org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:104)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
当我试图进入这个兽人社区的其他专栏时,一切都很好


如何捕获此异常以及如何修复这些文件?请帮助我

请提供DataFrame的示例数据我不知道如何复制此文件创建。当我试图处理这个文件时,我得到了NPE