Spark Scala SQL:取非空列的平均值
如何计算dataframe df中具有非空值的数组cols中列的平均值?我可以对所有列执行此操作,但当任何值为null时,它会给出nullSpark Scala SQL:取非空列的平均值,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,如何计算dataframe df中具有非空值的数组cols中列的平均值?我可以对所有列执行此操作,但当任何值为null时,它会给出null val cols = Array($"col1", $"col2", $"col3") df.withColumn("avgCols", cols.foldLeft(lit(0)){(x, y) => x + y} / cols.length) 我不想用na.fill,因为我想保留真实的平均值。我想你可以这样做: val cols = Arr
val cols = Array($"col1", $"col2", $"col3")
df.withColumn("avgCols", cols.foldLeft(lit(0)){(x, y) => x + y} / cols.length)
我不想用na.fill,因为我想保留真实的平均值。我想你可以这样做:
val cols = Array("col1", "col2", "col3")
def countAvg =
udf((data: Row) => {
val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
notNullIndices.map(i => data.getDouble(i)).sum / notNullIndices.lenght
})
df.withColumn("seqNull", struct(cols.map(col): _*))
.withColumn("avg", countAvg(col("seqNull")))
.show(truncate = false)
但要小心,这里的平均值只计算非空元素
如果您需要与代码中完全相同的解决方案:
val cols = Array("col1", "col2", "col3")
def countAvg =
udf((data: Row) => {
val notNullIndices = cols.indices.filterNot(i => data.isNullAt(i))
notNullIndices.map(i => data.getDouble(i)).sum / cols.lenght
})
df.withColumn("seqNull", struct(cols.map(col): _*))
.withColumn("avg", countAvg(col("seqNull")))
.show(truncate = false)