Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark Scala SQL:取非空列的平均值_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Spark Scala SQL:取非空列的平均值

Spark Scala SQL:取非空列的平均值,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,如何计算dataframe df中具有非空值的数组cols中列的平均值?我可以对所有列执行此操作,但当任何值为null时,它会给出null val cols = Array($"col1", $"col2", $"col3") df.withColumn("avgCols", cols.foldLeft(lit(0)){(x, y) => x + y} / cols.length) 我不想用na.fill,因为我想保留真实的平均值。我想你可以这样做: val cols = Arr

如何计算dataframe df中具有非空值的数组cols中列的平均值?我可以对所有列执行此操作,但当任何值为null时,它会给出null

val cols = Array($"col1", $"col2", $"col3")
df.withColumn("avgCols", cols.foldLeft(lit(0)){(x, y) => x + y} / cols.length)

我不想用na.fill,因为我想保留真实的平均值。

我想你可以这样做:

    val cols = Array("col1", "col2", "col3")
    def countAvg =
      udf((data: Row) => {
        val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
        notNullIndices.map(i => data.getDouble(i)).sum / notNullIndices.lenght
      })

    df.withColumn("seqNull", struct(cols.map(col): _*))
      .withColumn("avg", countAvg(col("seqNull")))
      .show(truncate = false)
但要小心,这里的平均值只计算非空元素

如果您需要与代码中完全相同的解决方案:

    val cols = Array("col1", "col2", "col3")
    def countAvg =
      udf((data: Row) => {
        val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
        notNullIndices.map(i => data.getDouble(i)).sum / cols.lenght
      })

    df.withColumn("seqNull", struct(cols.map(col): _*))
      .withColumn("avg", countAvg(col("seqNull")))
      .show(truncate = false)