忽略空值的不同列的平均值,Spark Scala
我有一个具有不同列的数据帧,我试图做的是忽略空值的差异列的平均值。例如:忽略空值的不同列的平均值,Spark Scala,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个具有不同列的数据帧,我试图做的是忽略空值的差异列的平均值。例如: +--------+-------+---------+-------+ | Baller | Power | Vision | KXD | +--------+-------+---------+-------+ | John | 5 | null | 10 | | Bilbo | 5 | 3 | 2 | +--------+-------+---------+
+--------+-------+---------+-------+
| Baller | Power | Vision | KXD |
+--------+-------+---------+-------+
| John | 5 | null | 10 |
| Bilbo | 5 | 3 | 2 |
+--------+-------+---------+-------+
输出必须是:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | 7.5 |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
我在做什么:
val a_cols = Array(col("Power"), col("Vision"), col("KXD"))
val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length
val avg_calc = df.withColumn("MEAN", avgFunc)
但是我得到了空值:
+--------+-------+---------+-------+-----------+
| Baller | Power | Vision | KXD | MEAN |
+--------+-------+---------+-------+-----------+
| John | 5 | null | 10 | null |
| Bilbo | 5 | 3 | 2 | 3,33 |
+--------+-------+---------+-------+-----------+
您可以分解列并按+平均值进行分组,然后使用Baller列连接回原始数据帧:
val result = df.join(
df.select(
col("Baller"),
explode(array(col("Power"), col("Vision"), col("KXD")))
).groupBy("Baller").agg(mean("col").as("MEAN")),
Seq("Baller")
)
result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD| MEAN|
+------+-----+------+---+------------------+
| John| 5| null| 10| 7.5|
| Bilbo| 5| 3| 2|3.3333333333333335|
+------+-----+------+---+------------------+
你的spark版本是什么?嗨!我的版本是2.2.1