Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/286.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 作业不断失败,错误消息为“未知”;org.xerial.snappy.SnappyIOException:[空输入]无法解压缩空流";-如何调试?_Python_Apache Spark_Pyspark - Fatal编程技术网

Python 作业不断失败,错误消息为“未知”;org.xerial.snappy.SnappyIOException:[空输入]无法解压缩空流";-如何调试?

Python 作业不断失败,错误消息为“未知”;org.xerial.snappy.SnappyIOException:[空输入]无法解压缩空流";-如何调试?,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我正在使用pyspark-2.4.0,一个大型作业不断崩溃,出现以下错误消息(保存到拼花地板或尝试收集结果时): py4j.protocol.Py4JJavaError:调用时出错 o2495.collectToPython.:org.apache.spark.SparkException:作业中止 由于阶段失败:阶段290.0中的任务184失败了4次,大多数 最近的失败:在阶段290.0中丢失任务184.3(TID 17345, 53.62.154.250,执行器5):org.xerial.s

我正在使用pyspark-2.4.0,一个大型作业不断崩溃,出现以下错误消息(保存到拼花地板或尝试收集结果时):

py4j.protocol.Py4JJavaError:调用时出错 o2495.collectToPython.:org.apache.spark.SparkException:作业中止 由于阶段失败:阶段290.0中的任务184失败了4次,大多数 最近的失败:在阶段290.0中丢失任务184.3(TID 17345, 53.62.154.250,执行器5):org.xerial.snappy.SnappyIOException:[空输入]无法在 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:94) 在 SnappyInputStream.(SnappyInputStream.java:59) 在 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:164) 在 org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163) 在 org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209) 在 org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:698) 在 org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:696) 在scala.Option.map(Option.scala:146)处 org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:696) 位于org.apache.spark.storage.BlockManager.get(BlockManager.scala:820) 在 org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:875) 位于org.apache.spark.rdd.rdd.getOrCompute(rdd.scala:335) org.apache.spark.rdd.rdd.iterator(rdd.scala:286)位于 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324) 位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324) 位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324) 位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324) 位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324) 位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324) 位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:324) 位于org.apache.spark.rdd.rdd.iterator(rdd.scala:288) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 在 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) 位于org.apache.spark.scheduler.Task.run(Task.scala:121) org.apache.spark.executor.executor$TaskRunner$$anonfun$10.apply(executor.scala:402) 位于org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) 在 org.apache.spark.executor.executor$TaskRunner.run(executor.scala:408) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 运行(Thread.java:748)

我的问题是,我不知道是哪个操作导致了问题。错误消息没有给出任何关于此的指示,堆栈跟踪也不包含任何自定义代码


有什么想法会导致这种情况,或者我如何才能找到工作失败的确切原因?

我在网上搜索时发现了一个链接:

总结如下:

Spark 2.4 use 1.1.7.x snappy-java, but its behavior is different from 1.1.2.x 
which is used in Spark 2.0.x. SnappyOutputStream in 1.1.2.x version always writes a snappy 
header whether or not to write a value, but SnappyOutputStream in 1.1.7.x don't generate 
header if u don't write value into it, so in spark 2.4 if RDD cache a empty value, 
memoryStore will not cache any bytes ( no snappy header ), then it will throw the empty 
error. 

此外,如果您找到了解决方案(而不是将Spark降级到2.0v),请务必在此处告知我们。

您是否修复了它或获得了解决方案?我添加了一些附加检查以防止写入空数据帧。此后,问题显然没有再次出现。但这并不是一个真正的“解决方案”。Spark版本的降级对我来说很有效。不确定这是snappy最新版本中的错误还是Spark配置的一些调整问题。知道吗?