Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/apache-kafka/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何获取所有值均为空的列名?_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 如何获取所有值均为空的列名?

Scala 如何获取所有值均为空的列名?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,当列的值为空时,我不知道如何获取列名称 比如说, case class A(name: String, id: String, email: String, company: String) val e1 = A("n1", null, "n1@c1.com", null) val e2 = A("n2", null, "n2@c1.com", null) val e3 = A("n3", null, "n3@c1.com", null) val e4 = A("n4", null, "n4@

当列的值为空时,我不知道如何获取列名称

比如说,

case class A(name: String, id: String, email: String, company: String)

val e1 = A("n1", null, "n1@c1.com", null)
val e2 = A("n2", null, "n2@c1.com", null)
val e3 = A("n3", null, "n3@c1.com", null)
val e4 = A("n4", null, "n4@c2.com", null)
val e5 = A("n5", null, "n5@c2.com", null)
val e6 = A("n6", null, "n6@c2.com", null)
val e7 = A("n7", null, "n7@c3.com", null)
val e8 = A("n8", null, "n8@c3.com", null)
val As = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(As).toDF
此代码使dataframe如下所示:

+----+----+---------+-------+
|name|  id|    email|company|
+----+----+---------+-------+
|  n1|null|n1@c1.com|   null|
|  n2|null|n2@c1.com|   null|
|  n3|null|n3@c1.com|   null|
|  n4|null|n4@c2.com|   null|
|  n5|null|n5@c2.com|   null|
|  n6|null|n6@c2.com|   null|
|  n7|null|n7@c3.com|   null|
|  n8|null|n8@c3.com|   null|
+----+----+---------+-------+
我想得到列名,它们的所有行都是空的:id,company


我不在乎输出的类型。数组、字符串、RDD无论什么

您可以对所有列进行简单计数,然后使用返回计数为
0
的列的索引,将
df.columns
作为子集:

import org.apache.spark.sql.functions.{count,col}
// Get column indices
val col_inds = df.select(df.columns.map(c => count(col(c)).alias(c)): _*)
                 .collect()(0)
                 .toSeq.zipWithIndex
                 .filter(_._1 == 0).map(_._2)
// Subset column names using the indices
col_inds.map(i => df.columns.apply(i))
//Seq[String] = ArrayBuffer(id, company)

另一种解决方案如下(但我担心性能可能不令人满意)

val id=Seq(
(“1”,空:字符串),
(“1”,空:字符串),
(“10”,空:字符串)
).toDF(“id”,“全部为空”)
scala>ids.show
+---+---------+
|id |所有为空|
+---+---------+
|1 |空|
|1 |空|
|10 |零|
+---+---------+
val s=ids.columns。
映射{c=>
(c,id.select(c).dropDuplicates(c).na.drop.count)}c}
scala>s.foreach(println)
全部为空

我认为
dropDuplicates(c)
是性能不足的地方。我说得对吗?我更关心的是对列进行迭代,并对每个列进行相同的分布式计数。对于许多列来说,这可能需要一些时间,并且通过并行作业提交(例如,
ids.columns.par
)可能会更快。
val ids = Seq(
  ("1", null: String), 
  ("1", null: String),
  ("10", null: String)
).toDF("id", "all_nulls")

scala> ids.show
+---+---------+
| id|all_nulls|
+---+---------+
|  1|     null|
|  1|     null|
| 10|     null|
+---+---------+

val s = ids.columns.
  map { c => 
    (c, ids.select(c).dropDuplicates(c).na.drop.count) }. // <-- performance here!
  collect { case (c, cnt) if cnt == 0 => c }
scala> s.foreach(println)
all_nulls