Scala 如何使用Spark有效地检查列中的所有值？_Scala_Apache Spark

Scala 如何使用Spark有效地检查列中的所有值？

scala apache-spark

Scala 如何使用Spark有效地检查列中的所有值？,scala,apache-spark,Scala,Apache Spark,我想知道在Spark中给定未知列时如何制作动态过滤器例如，数据帧如下所示： +-------+-------+-------+-------+-------+-------+ | colA | colB | colC | colD | colE | colF | +-------+-------+-------+-------+-------+-------+ | Red | Red | Red | Red | Red | Red | | Red |

我想知道在Spark中给定未知列时如何制作动态过滤器

例如，数据帧如下所示：

+-------+-------+-------+-------+-------+-------+
| colA  | colB  |  colC |  colD |  colE |  colF | 
+-------+-------+-------+-------+-------+-------+
| Red   | Red   | Red   | Red   | Red   | Red   |
| Red   | Red   | Red   | Red   | Red   | Red   |
| Red   | Blue  | Red   | Red   | Red   | Red   |
| Red   | Red   | Red   | Red   | Red   | Red   |
| Red   | Red   | Red   | Red   | Blue  | Red   |
| Red   | Red   | White | Red   | Red   | Red   |
+-------+-------+-------+-------+-------+-------+

这些列只能在运行时知道，这意味着它可以有colG，H。。我需要检查整个列的值是否为红色，然后得到一个计数，在上面的例子中是3，因为colA、colD和ColF列都是红色的

我正在做的是下面这样的事情，而且速度很慢

   val allColumns = df.columns
   df.foldLeft(allColumns) {

      (df, column) =>
        val tmpDf = df.filter(df(column) === "Red")
        if (tmpDf.rdd.isEmpty) {
          count += 1
        }
        df
    }

我想知道是否有更好的办法。非常感谢

您得到了N个RDD扫描，其中N是列数。您可以一次扫描所有这些文件，并并行减少。例如：

df.reduce((a, r) => Row.fromSeq(a.toSeq.zip(r.toSeq)
    .map { case (a, r) => 
          if (a == "Red" && r == "Red") "Red" else "Not" 
    }
))

res11: org.apache.spark.sql.Row = [Red,Not,Not]

这段代码将执行一次RDD扫描，然后在reduce中迭代行列。Row.toSeq从Row获取Seq。fromSeq restore Row返回相同的对象

编辑：对于计数，只需添加：

.toSeq.filter（==“Red”）.size

您得到了N个RDD扫描，其中N是列数。您可以一次扫描所有这些文件，并并行减少。例如：

df.reduce((a, r) => Row.fromSeq(a.toSeq.zip(r.toSeq)
    .map { case (a, r) => 
          if (a == "Red" && r == "Red") "Red" else "Not" 
    }
))

res11: org.apache.spark.sql.Row = [Red,Not,Not]

这段代码将执行一次RDD扫描，然后在reduce中迭代行列。Row.toSeq从Row获取Seq。fromSeq restore Row返回相同的对象

编辑：对于计数，只需添加：

.toSeq.filter（=“Red”）.size

为什么不直接使用数据帧API执行

df.filter

df.count

val filter_expr = df.columns.map(c => col(c) === lit("Red")).reduce(_ and _)

val count = df.filter(filter_expr).count

//count: Long = 3

为什么不只使用DataFrameAPI进行

df.filter

df.count

val filter_expr = df.columns.map(c => col(c) === lit("Red")).reduce(_ and _)

val count = df.filter(filter_expr).count

//count: Long = 3