Loops Pyspark中的循环导致sparkException

Loops Pyspark中的循环导致sparkException,loops,apache-spark,pyspark,pyspark-sql,Loops,Apache Spark,Pyspark,Pyspark Sql,在齐柏林飞艇和Pypark。 在我找到正确的操作方法(最后一个窗口)之前,我有一个循环,它将前一行的值一个接一个地扩展到自身(我知道循环是不好的做法)。但是,在运行几百次之后,它会在达到最佳情况=0之前失败,并出现nullPointerException 为了避免这个错误,(在我发现最后一个命令之前),我让循环在中点条件=1000的情况下运行了几百次,转储结果。在条件=500时再次运行,冲洗并重复,直到达到条件=0 def扩展目标(myDF、loop、lessThan): i=myDF.filt

在齐柏林飞艇和Pypark。 在我找到正确的操作方法(最后一个窗口)之前,我有一个循环,它将前一行的值一个接一个地扩展到自身(我知道循环是不好的做法)。但是,在运行几百次之后,它会在达到最佳情况=0之前失败,并出现nullPointerException

为了避免这个错误,(在我发现最后一个命令之前),我让循环在中点条件=1000的情况下运行了几百次,转储结果。在条件=500时再次运行,冲洗并重复,直到达到条件=0

def扩展目标(myDF、loop、lessThan):
i=myDF.filter(列(“目标”)==“未知”).count()
而(i>更少):
cc=回路
而(cc>0):
myDF=myDF.withColumn(“targetPrev”,lag(“target”,1).over(Window.partitionBy(“id”).orderBy(“myTime”))
myDF=myDF.withColumn(“targetNew”),当(col(“target”)=“unknown”,col(“targetPrev”)。否则(col(“target”))
myDF=myDF.select(
“身份证”,
“我的时间”,
col(“targetNew”)。别名(“target”))
cc=cc-1
i=myDF.filter(列(“目标”)==“未知”).count()
打印i
返回myDF
myData=spark.read.load(myPath)
myData=extendTarget(myData,20,0)
myData.write.parquet(myPathPart1)
我认为这需要很长时间(因为我做错了),但不要指望它会例外

Output (given inputs (myData, 20, 0)
38160
22130
11375
6625
5085
4522
4216
3936
3662
3419
3202

Error 
Py4JJavaError: An error occurred while calling o26814.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 32 in stage 1539.0 failed 4 times, most recent failure: Lost task 32.3 in stage 1539.0 (TID XXXX, ip-XXXX, executor 17): ExecutorLostFailure (executor 17 exited caused by one of the running tasks) Reason: Container from a bad node: container_XXXX_0001_01_000033 on host: ip-XXXX. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_XXXX_0001_01_000033
Exit code: 50
Stack trace: ExitCodeException exitCode=50: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
    at org.apache.hadoop.util.Shell.run(Shell.java:869)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 50
.
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
    at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2830)
    at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2829)
    at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
    at org.apache.spark.sql.Dataset.count(Dataset.scala:2829)
    at sun.reflect.GeneratedMethodAccessor388.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling o26814.count.\n', JavaObject id=o26815), <traceback object at 0x7efc521b11b8>)
输出(给定输入(myData,20,0)
38160
22130
11375
6625
5085
4522
4216
3936
3662
3419
3202
错误
Py4JJavaError:调用o26814.count时出错。
:org.apache.spark.sparkeexception:作业因阶段失败而中止:阶段1539.0中的任务32失败4次,最近的失败:阶段1539.0中的任务32.3丢失(TID XXXX,ip XXXX,executor 17):executor LostFailure(执行器17因一个正在运行的任务而退出)原因:来自坏节点的容器:主机ip XXXX上的容器_XXXX_0001_01_000033。退出状态:50。诊断:容器启动异常。
集装箱id:Container_XXXX_0001_01_000033
出境代码:50
堆栈跟踪:ExitCodeException exitCode=50:
位于org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
位于org.apache.hadoop.util.Shell.run(Shell.java:869)
位于org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
位于org.apache.hadoop.warn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:235)
位于org.apache.hadoop.warn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
位于org.apache.hadoop.warn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:83)
在java.util.concurrent.FutureTask.run(FutureTask.java:266)处
位于java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
位于java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
运行(Thread.java:748)
容器以非零退出代码50退出
.
驱动程序堆栈跟踪:
位于org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
位于scala.collection.mutable.resizeblearray$class.foreach(resizeblearray.scala:59)
位于scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
位于org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
位于org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
位于scala.Option.foreach(Option.scala:257)
位于org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
位于org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2200)
位于org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
位于org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:777)
位于org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
位于org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
位于org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
位于org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
位于org.apache.spark.rdd.rdd$$anonfun$collect$1.apply(rdd.scala:945)
位于org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
位于org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
位于org.apache.spark.rdd.rdd.withScope(rdd.scala:363)
位于org.apache.spark.rdd.rdd.collect(rdd.scala:944)
位于org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
位于org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2830)
位于org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2829)
位于org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
位于org.apache.spark.sql.execution.SQLExecution$$anonfun$和newexecutionid$1.apply(SQLExecution.scala:78)
在org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
位于org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
位于org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
位于org.apache.spark.sql.Dataset.count(Dataset.scala:2829)
位于sun.reflect.GeneratedMethodAccessor388.invoke(未知源)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(Delegati
| id | time | target |
| a  | 1:00 | 1      |
| a  | 1:01 | unknown|
| a  | .    | .      |
| a  | 5:00 | unknown|
| a  | 5:01 | 2      |
| id | time | target |
| a  | 1:00 | 1      |
| a  | 1:01 | 1      |
| a  | .    | 1      |
| a  | 5:00 | 1      |
| a  | 5:01 | 2      |
spark.sparkContext.setCheckpointDir(".../myCheckpointsPath/")
def extendTarget(myDF, loop, lessThan):
    i = myDF.filter(col("target") == "unknown").count()
    while (i > lessThan):
        cc = loop
        while (cc > 0):
            myDF = myDF.withColumn("targetPrev", lag("target", 1).over(Window.partitionBy("id").orderBy("myTime"))) 
            myDF = myDF.withColumn("targetNew", when(col("target") == "unknown", col("targetPrev")).otherwise(col("target")))
            myDF = myDF.select(
            "id",
            "myTime",
            col("targetNew").alias("target"))
            cc = cc - 1
        i = myDF.filter(col("target") == "unknown").count()
        print i
        myDF = myDF.checkpoint()
    return myDF

myData = spark.read.load(myPath)
myData = extendTarget(myData, 20, 0)
myData.write.parquet(myPathPart1)