Pyspark 解决:org.apache.spark.sparkeException:由于阶段失败,作业中止
您好,我正面临一个与pyspark相关的问题,我使用了Pyspark 解决:org.apache.spark.sparkeException:由于阶段失败,作业中止,pyspark,Pyspark,您好,我正面临一个与pyspark相关的问题,我使用了df.show()它仍然会给我一个结果,但当我使用一些函数时,比如count(),groupby()v..v它向我显示错误,我认为原因是“df”太大 请帮我解决它。谢谢 import datetime from pyspark import SparkContext from pyspark.sql import SparkSession spark = SparkSession.builder.appName("box"
df.show()
它仍然会给我一个结果,但当我使用一些函数时,比如count()
,groupby()
v..v它向我显示错误,我认为原因是“df”太大
请帮我解决它。谢谢
import datetime
from pyspark import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("box") \
.config("spark.driver.memory", "25g",conf) \
.getOrCreate()
basepath = '/mnt/raw_data/play/log_stream/playstats_v100/topic=play_map_play_vod'
path = ['/mnt/raw_data/play/log_stream/playstats_v100/topic=play_map_play_vod/date=2021-01*']
df = spark.read.option("basePath",basepath).parquet(*path)
df.count()
错误:
--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent
call last) <ipython-input-321-3c9a60fd698f> in <module>()
----> 1 df.count() ~/anaconda3/lib/python3.6/site-packages/pyspark/sql/dataframe.py in
count(self)
453 2
454 """
--> 455 return int(self._jdf.count())
456
457 @ignore_unicode_prefix ~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in
__call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 1256
return_value
= get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name) 1258 1259 for temp_arg in temp_args:
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in
deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString() ~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError( Py4JJavaError: An error occurred while calling o2635.count. :
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 312 in stage 1079.0 failed 1 times, most recent failure: Lost
task 312.0 in stage 1079.0 (TID 54105, localhost, executor driver):
org.apache.hadoop.fs.FSError: java.io.IOException: No such device or
address at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:163)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.readFully(DataInputStream.java:195) at
java.io.DataInputStream.readFully(DataInputStream.java:169) at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:151)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) at
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:443)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:401)
at
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:106)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:404)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:345)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:128)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
java.io.IOException: No such device or address at
java.io.FileInputStream.readBytes(Native Method) at
java.io.FileInputStream.read(FileInputStream.java:255) at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:156)
... 32 more Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at
org.apache.spark.rdd.RDD.collect(RDD.scala:944) at
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
at
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2775)
at
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2774)
at
org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3259)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258) at
org.apache.spark.sql.Dataset.count(Dataset.scala:2774) at
sun.reflect.GeneratedMethodAccessor369.invoke(Unknown Source) at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:748) Caused by:
org.apache.hadoop.fs.FSError: java.io.IOException: No such device or
address at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:163)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.DataInputStream.readFully(DataInputStream.java:195) at
java.io.DataInputStream.readFully(DataInputStream.java:169) at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:151)
at
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) at
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:443)
at
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:401)
at
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:106)
at
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:404)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:345)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:128)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
Source) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more Caused by: java.io.IOException: No such device or
address at java.io.FileInputStream.readBytes(Native Method) at
java.io.FileInputStream.read(FileInputStream.java:255) at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:156)
... 32 more
--------------------------------------------------------------------------------------Py4JJavaError回溯(最新版本)
在()
---->1 df.count()~/anaconda3/lib/python3.6/site-packages/pyspark/sql/dataframe.py in
计数(自我)
453 2
454 """
-->455返回int(self.\u jdf.count())
456
457@ignore_unicode_prefix~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in
__call_uuu(self,*args)1255 answer=self.gateway_client.send_command(command)1256
返回值
=获取返回值(
->1257回答,self.gateway_客户端,self.target_id,self.name)1258 1259用于临时参数中的临时参数:
~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in
装饰(*a,**千瓦)
61 def装饰(*a,**千瓦):
62尝试:
--->63返回f(*a,**kw)
64除py4j.protocol.Py4JJavaError外的其他错误为e:
65 s=e.java_exception.toString()~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in
获取\返回\值(应答、网关\客户端、目标\ id、名称)
326 raise Py4JJavaError(
327“调用{0}{1}{2}时出错。\n”。
-->328格式(目标id,“.”,名称),值)
329其他:
330 raise Py4JError(Py4JJavaError:调用o2635.count时出错:
org.apache.spark.sparkeexception:由于阶段失败,作业中止:
阶段1079.0中的任务312失败1次,最近一次失败:丢失
1079.0阶段的任务312.0(TID 54105,本地主机,执行器驱动程序):
org.apache.hadoop.fs.FSError:java.io.IOException:没有此类设备或
地址
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:163)
在java.io.BufferedInputStream.fill处(BufferedInputStream.java:246)
位于java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
在java.io.BufferedInputStream.read处(BufferedInputStream.java:345)
在java.io.DataInputStream.readFully(DataInputStream.java:195)处
java.io.DataInputStream.readFully(DataInputStream.java:169)位于
ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:151)
在
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
位于org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65)
在
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:443)
在
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:401)
在
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:106)
在
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
在
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildreader,分区值$1.apply(ParquetFileFormat.scala:404)
在
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildreader,分区值$1.apply(ParquetFileFormat.scala:345)
在
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:128)
在
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
在
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
在
org.apache.spark.sql.catalyst.expressions.GeneratedClass$generatorforcodegenstage1.scan_nextBatch_0$(未知
来源)在
org.apache.spark.sql.catalyst.expressions.GeneratedClass$generatorforcodegenstage1.agg_doaggregate without key_0$(未知
来源)在
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GenerateEditorForCodeGenStage1.processNext(未知
来源)在
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
在
org.apache.spark.sql.execution.whisttagecodegenexec$$anonfun$10$$anon$1.hasNext(whisttagecodegenexec.scala:614)
位于scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
在
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
在
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
在
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
位于org.apache.spark.scheduler.Task.run(Task.scala:109)
org.apache.spark.executor.executor$TaskRunner.run(executor.scala:345)
在
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
在
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
在java.lang.Thread.run(Thread.java:748)处,由以下原因引起:
java.io.IOException:在
位于的java.io.FileInputStream.readBytes(本机方法)
java.io.FileInputStream.read(FileInputStream.java:255)位于
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:156)
…另外32个驱动程序堆栈跟踪: