Python 忽略NaN值的pyspark列的总和
我有一个pypark数据帧,其方式如下:Python 忽略NaN值的pyspark列的总和,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,我有一个pypark数据帧,其方式如下: +---+----+----+ | id|col1|col2| +---+----+----+ | 1| 1| 3| | 2| NaN| 4| | 3| 3| 5| +---+----+----+ 我想对col1和col2求和,结果如下: +---+----+----+---+ | id|col1|col2|sum| +---+----+----+---+ | 1| 1| 3| 4| | 2| NaN| 4|
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| 1| 3|
| 2| NaN| 4|
| 3| 3| 5|
+---+----+----+
我想对col1
和col2
求和,结果如下:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4| 4|
| 3| 3| 5| 8|
+---+----+----+---+
以下是我尝试过的:
import pandas as pd
test = pd.DataFrame({
'id': [1, 2, 3],
'col1': [1, None, 3],
'col2': [3, 4, 5]
})
test = spark.createDataFrame(test)
test.withColumn('sum', F.col('col1') + F.col('col2')).show()
此代码返回:
+---+----+----+---+
| id|col1|col2|sum|
+---+----+----+---+
| 1| 1| 3| 4|
| 2| NaN| 4|NaN| # <-- I want a 4 here, not this NaN
| 3| 3| 5| 8|
+---+----+----+---+
+---+----+----+---+
|id | col1 | col2 | sum|
+---+----+----+---+
| 1| 1| 3| 4|
|2 | NaN | 4 | NaN |#使用F.nanvl
将NaN
替换为给定值(此处为0):
请发表评论:
result = test.withColumn('sum',
F.when(
F.isnan(F.col('col1')) & F.isnan(F.col('col2')),
F.lit(float('nan'))
).otherwise(
F.nanvl(F.col('col1'), F.lit(0)) + F.nanvl(F.col('col2'), F.lit(0))
)
)
result = test.withColumn('sum',
F.when(
F.isnan(F.col('col1')) & F.isnan(F.col('col2')),
F.lit(float('nan'))
).otherwise(
F.nanvl(F.col('col1'), F.lit(0)) + F.nanvl(F.col('col2'), F.lit(0))
)
)