Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/362.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 查找PySpark中每行的最新非空值_Python_Pyspark - Fatal编程技术网

Python 查找PySpark中每行的最新非空值

Python 查找PySpark中每行的最新非空值,python,pyspark,Python,Pyspark,我有一个像这样的PySpark数据框 +----------+------+------+------+------+------+------+------+------+------+------+------+------+------+ |id |201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906| +----------+------+----

我有一个像这样的PySpark数据框

+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|id        |201806|201807|201808|201809|201810|201811|201812|201901|201902|201903|201904|201905|201906|
+----------+------+------+------+------+------+------+------+------+------+------+------+------+------+
|  1       |    15|    15|    15|    15|    15|    15|    15|    15|    15|  null|    15|    15|    15|
|  2       |     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|     4|
|  3       |     7|     7|     7|     7|     7|     7|     7|     7|  null|  null|  null|  null|  null|
-------------------------------------------------------------------------------------------------------
从这些数据中,我想找到每行的最新非空值

我期望得到以下结果

+----------+------+
|id.         |latest|
+----------+------+
|  1       |    15| 
|  2       |     4|  
|  3       |     7|  
-------------------
我遵循了这一点,但我无法执行每行的操作

我曾经

df.select([last(x, ignorenulls=True).alias(x) for x in df.columns])

但是这段代码只按列执行,我希望按行执行相同的操作。

假设您的列是从最早到最晚排序的,您可以使用下面的代码,它使用
coalesce
来获得最新的值

from pyspark.sql.functions import coalesce

df.select('id', coalesce(*[i for i in df.columns[::-1] if i != 'id']).alias('latest')).show()
输出:

+---+------+
| id|latest|
+---+------+
|  1|    15|
|  2|     4|
|  3|     7|
+---+------+

假设您的列是从最早到最新排序的,您可以使用下面的代码,它使用
coalesce
来获取最新的值

from pyspark.sql.functions import coalesce

df.select('id', coalesce(*[i for i in df.columns[::-1] if i != 'id']).alias('latest')).show()
输出:

+---+------+
| id|latest|
+---+------+
|  1|    15|
|  2|     4|
|  3|     7|
+---+------+

你为同样的目的尝试了什么?我已经更新了…你为同样的目的尝试了什么?我已经更新了。。。。