Apache spark spark数据帧空值计数
我是spark的新手,我想计算每列的空值率(我有200列),我的函数如下:Apache spark spark数据帧空值计数,apache-spark,null,spark-dataframe,Apache Spark,Null,Spark Dataframe,我是spark的新手,我想计算每列的空值率(我有200列),我的函数如下: def nullCount(dataFrame: DataFrame): Unit = { val args = dataFrame.columns.length val cols = dataFrame.columns val d=dataFrame.count() println("Follows are the null value rate of each columns") for (i <- Range
def nullCount(dataFrame: DataFrame): Unit = {
val args = dataFrame.columns.length
val cols = dataFrame.columns
val d=dataFrame.count()
println("Follows are the null value rate of each columns")
for (i <- Range(0,args)) {
var nullrate = dataFrame.rdd.filter(r => r(i) == (-900)).count.toDouble / d
println(cols(i), nullrate)
}
def nullCount(数据帧:数据帧):单位={
val args=dataFrame.columns.length
val cols=dataFrame.columns
val d=dataFrame.count()
println(“以下是每列的空值率”)
对于(i r(i)=(-900)).count.toDouble/d
println(cols(i),空率)
}
}
但是我发现它太慢了,有没有更有效的方法呢?改编自:
使用-900:
df.select(df.columns.map(
c => (count(when(col(c) === -900, col(c))) / count("*")).alias(c)): _*)
我将空值设置为-900,以避免在模型训练中丢失信息
df.select(df.columns.map(
c => (count(when(col(c) === -900, col(c))) / count("*")).alias(c)): _*)