Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark spark数据帧空值计数_Apache Spark_Null_Spark Dataframe - Fatal编程技术网

Apache spark spark数据帧空值计数

Apache spark spark数据帧空值计数,apache-spark,null,spark-dataframe,Apache Spark,Null,Spark Dataframe,我是spark的新手,我想计算每列的空值率(我有200列),我的函数如下: def nullCount(dataFrame: DataFrame): Unit = { val args = dataFrame.columns.length val cols = dataFrame.columns val d=dataFrame.count() println("Follows are the null value rate of each columns") for (i <- Range

我是spark的新手,我想计算每列的空值率(我有200列),我的函数如下:

def nullCount(dataFrame: DataFrame): Unit = {
val args = dataFrame.columns.length
val cols = dataFrame.columns
val d=dataFrame.count()
println("Follows are the null value rate of each columns")
for (i <- Range(0,args)) {
  var nullrate = dataFrame.rdd.filter(r => r(i) == (-900)).count.toDouble / d
  println(cols(i), nullrate)
}
def nullCount(数据帧:数据帧):单位={
val args=dataFrame.columns.length
val cols=dataFrame.columns
val d=dataFrame.count()
println(“以下是每列的空值率”)
对于(i r(i)=(-900)).count.toDouble/d
println(cols(i),空率)
}
}

但是我发现它太慢了,有没有更有效的方法呢?

改编自:

使用-900:

df.select(df.columns.map(
  c => (count(when(col(c) === -900, col(c))) / count("*")).alias(c)): _*)

我将空值设置为-900,以避免在模型训练中丢失信息
df.select(df.columns.map(
  c => (count(when(col(c) === -900, col(c))) / count("*")).alias(c)): _*)