忽略空值的不同列的平均值,Spark Scala

忽略空值的不同列的平均值,Spark Scala,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个具有不同列的数据帧,我试图做的是忽略空值的差异列的平均值。例如: +--------+-------+---------+-------+ | Baller | Power | Vision | KXD | +--------+-------+---------+-------+ | John | 5 | null | 10 | | Bilbo | 5 | 3 | 2 | +--------+-------+---------+

我有一个具有不同列的数据帧,我试图做的是忽略空值的差异列的平均值。例如:

+--------+-------+---------+-------+
| Baller | Power | Vision  | KXD   |
+--------+-------+---------+-------+
| John   |   5   |    null |   10  |
| Bilbo  |   5   |    3    |    2  |
+--------+-------+---------+-------+
输出必须是:

+--------+-------+---------+-------+-----------+
| Baller | Power | Vision  | KXD   | MEAN      |
+--------+-------+---------+-------+-----------+
| John   |   5   |    null |   10  |    7.5    |
| Bilbo  |   5   |    3    |    2  |    3,33   |
+--------+-------+---------+-------+-----------+
我在做什么:

val a_cols = Array(col("Power"), col("Vision"), col("KXD"))

val avgFunc = a_cols.foldLeft(lit(0)){(x, y) => x+y}/a_cols.length

val avg_calc = df.withColumn("MEAN", avgFunc)
但是我得到了空值:

+--------+-------+---------+-------+-----------+
| Baller | Power | Vision  | KXD   | MEAN      |
+--------+-------+---------+-------+-----------+
| John   |   5   |    null |   10  |    null   |
| Bilbo  |   5   |    3    |    2  |    3,33   |
+--------+-------+---------+-------+-----------+

您可以分解列并按+平均值进行分组,然后使用Baller列连接回原始数据帧:

val result = df.join(
    df.select(
        col("Baller"), 
        explode(array(col("Power"), col("Vision"), col("KXD")))
    ).groupBy("Baller").agg(mean("col").as("MEAN")), 
    Seq("Baller")
)

result.show
+------+-----+------+---+------------------+
|Baller|Power|Vision|KXD|              MEAN|
+------+-----+------+---+------------------+
|  John|    5|  null| 10|               7.5|
| Bilbo|    5|     3|  2|3.3333333333333335|
+------+-----+------+---+------------------+

你的spark版本是什么?嗨!我的版本是2.2.1