Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/327.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在Python中检查UDF函数中pyspark dataframe列的单元格值是否为none或NaN以实现正向填充?_Python_Apache Spark_Pyspark_Spark Dataframe_Pyspark Sql - Fatal编程技术网

如何在Python中检查UDF函数中pyspark dataframe列的单元格值是否为none或NaN以实现正向填充?

如何在Python中检查UDF函数中pyspark dataframe列的单元格值是否为none或NaN以实现正向填充?,python,apache-spark,pyspark,spark-dataframe,pyspark-sql,Python,Apache Spark,Pyspark,Spark Dataframe,Pyspark Sql,我基本上是在做正向填充插补。下面是代码 df = spark.createDataFrame([(1,1, None), (1,2, 5), (1,3, None), (1,4, None), (1,5, 10), (1,6, None)], ('session',"timestamp", "id")) PRV_RANK = 0.0 def fun(rank): ########How to check if None or Nan? ############### if r

我基本上是在做正向填充插补。下面是代码

df = spark.createDataFrame([(1,1, None), (1,2, 5), (1,3, None), (1,4, None), (1,5, 10), (1,6, None)], ('session',"timestamp", "id"))

PRV_RANK = 0.0
def fun(rank):
    ########How to check if None or Nan?  ###############
    if rank is None or rank is NaN:
        return PRV_RANK
    else:
        PRV_RANK = rank
        return rank        

fuN= F.udf(fun, IntegerType())

df.withColumn("ffill_new", fuN(df["id"])).show()
我在日志中发现了奇怪的错误

编辑: 问题涉及如何使用python在spark数据帧中识别null&nan

编辑: 我假设下面检查NaN&Null的代码行是导致问题的原因。所以我给了这个问题相应的标题

回溯(最近一次呼叫最后一次):

文件“”,第1行,在 df_na.withColumn(“ffill_new”,forwardFill(df_na[“id”])).show()

文件“C:\Spark\python\pyspark\sql\dataframe.py”,第318行,在show中 打印(self.\u jdf.showString(n,20))

文件“C:\Spark\python\lib\py4j-0.10.4-src.zip\py4j\java\u gateway.py”, 第1133行,在调用中 回答,self.gateway\u客户端,self.target\u id,self.name)

文件“C:\Spark\python\pyspark\sql\utils.py”,第63行,deco格式 返回f(*a,**kw)

文件“C:\Spark\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py”, 第319行,在get_return_值中 格式(目标id,“.”,名称),值)

Py4JJavaError:调用o806.showString时出错: org.apache.spark.sparkeexception:由于阶段失败,作业中止: 阶段47.0中的任务0失败1次,最近一次失败:任务丢失 阶段47.0中的0.0(tid83,localhost,executor-driver):org.apache.spark.api.python.PythonException:回溯(最新版本) 调用最后一个文件 “C:\Spark\python\lib\pyspark.zip\pyspark\worker.py”,主视图第174行 文件“C:\Spark\python\lib\pyspark.zip\pyspark\worker.py”,第169行,在 进程文件“C:\Spark\python\lib\pyspark.zip\pyspark\worker.py”, 第106行,在文件中 “C:\Spark\python\lib\pyspark.zip\pyspark\worker.py”,第92行,在 文件“C:\Spark\python\lib\pyspark.zip\pyspark\worker.py”, 第70行,在文件“”的第5行, 在forwardfil UnboundLocalError中:引用了局部变量“PRV_RANK” 派遣前

在 org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) 在 PythonRunner$$anon$1.(PythonRDD.scala:234) 在 org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) 在 org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144) 在 org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87) 在 org.apache.spark.rdd.rdd$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(rdd.scala:797) 在 org.apache.spark.rdd.rdd$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(rdd.scala:797) 在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上 位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上 位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.rdd.computeOrReadCheckpoint(rdd.scala:323)上 位于org.apache.spark.rdd.rdd.iterator(rdd.scala:287) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)位于 org.apache.spark.scheduler.Task.run(Task.scala:99)位于 org.apache.spark.executor.executor$TaskRunner.run(executor.scala:322) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 运行(Thread.java:748)

驱动程序堆栈跟踪:在 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) 在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) 在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) 在 scala.collection.mutable.resizeblearray$class.foreach(resizeblearray.scala:59) 位于scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) 在 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) 在 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) 位于scala.Option.foreach(Option.scala:257) org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) 位于org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) 位于org.apache.spark.SparkContext.runJob(SparkContext.scala:1925) org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)位于 org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)位于 org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) 在 org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) 在 org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2386) 在 org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) 在 org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2788) 在 org.apache.spark.sql.Dataset.org$apache$spark$sql$Datase
df.withColumn("ffill_new", f.UserDefinedFunction(lambda x: x or 0, IntegerType())(df["id"])).show()