Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 忽略NaN值的pyspark列的总和_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 忽略NaN值的pyspark列的总和

Python 忽略NaN值的pyspark列的总和,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个pypark数据帧,其方式如下: +---+----+----+ | id|col1|col2| +---+----+----+ | 1| 1| 3| | 2| NaN| 4| | 3| 3| 5| +---+----+----+ 我想对col1和col2求和,结果如下: +---+----+----+---+ | id|col1|col2|sum| +---+----+----+---+ | 1| 1| 3| 4| | 2| NaN| 4|

我有一个pypark数据帧,其方式如下:

+---+----+----+
| id|col1|col2|
+---+----+----+
|  1|   1|   3|
|  2| NaN|   4|
|  3|   3|   5|
+---+----+----+
我想对
col1
col2
求和,结果如下:

+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
|  1|   1|   3|  4|
|  2| NaN|   4|  4|
|  3|   3|   5|  8|
+---+----+----+---+
以下是我尝试过的:

import pandas as pd

test = pd.DataFrame({
    'id': [1, 2, 3],
    'col1': [1, None, 3],
    'col2': [3, 4, 5]
})
test = spark.createDataFrame(test)
test.withColumn('sum', F.col('col1') + F.col('col2')).show()
此代码返回:

+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
|  1|   1|   3|  4|
|  2| NaN|   4|NaN| # <-- I want a 4 here, not this NaN
|  3|   3|   5|  8|
+---+----+----+---+
+---+----+----+---+
|id | col1 | col2 | sum|
+---+----+----+---+
|  1|   1|   3|  4|

|2 | NaN | 4 | NaN |#使用
F.nanvl
NaN
替换为给定值(此处为0):

请发表评论:

result = test.withColumn('sum', 
    F.when(
        F.isnan(F.col('col1')) & F.isnan(F.col('col2')), 
        F.lit(float('nan'))
    ).otherwise(
        F.nanvl(F.col('col1'), F.lit(0)) + F.nanvl(F.col('col2'), F.lit(0))
    )
)
result = test.withColumn('sum', 
    F.when(
        F.isnan(F.col('col1')) & F.isnan(F.col('col2')), 
        F.lit(float('nan'))
    ).otherwise(
        F.nanvl(F.col('col1'), F.lit(0)) + F.nanvl(F.col('col2'), F.lit(0))
    )
)