Scala 如何用另一个小数据帧（逐行）和xFF1F多次过滤一个大数据帧（等于小df的行数）；_Scala_Apache Spark_Apache Spark Sql

Scala 如何用另一个小数据帧（逐行）和xFF1F多次过滤一个大数据帧（等于小df的行数）；

scala apache-spark

Scala 如何用另一个小数据帧（逐行）和xFF1F多次过滤一个大数据帧（等于小df的行数）；,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有两个spark数据帧，dfA和dfB。我想按dfB的每一行过滤dfA，这意味着如果dfB有10000行，我需要用dfB生成的10000个不同过滤条件过滤dfA 10000次。然后，在每次筛选之后，我需要在dfB中将筛选结果收集为一列 dfA dfB +------+---------+---------+ +-----+-------------+--------------+ | id | v

我有两个spark数据帧，

dfA

和

dfB

。我想按

dfB

的每一行过滤

dfA

，这意味着如果

dfB

有10000行，我需要用

dfB

生成的10000个不同过滤条件过滤dfA 10000次。然后，在每次筛选之后，我需要在

dfB

中将筛选结果收集为一列

dfA                                    dfB
+------+---------+---------+           +-----+-------------+--------------+
|  id  |  value1 |  value2 |           | id  |  min_value1 |  max_value1  |
+------+---------+---------+           +-----+-------------+--------------+            
|  1   |    0    |   4345  |           |  1  |     0       |       3      |
|  1   |    1    |   3434  |           |  1  |     5       |       9      |
|  1   |    2    |   4676  |           |  2  |     1       |       4      |
|  1   |    3    |   3454  |           |  2  |     6       |       8      |
|  1   |    4    |   9765  |           +-----+-------------+--------------+
|  1   |    5    |   5778  |           ....more rows, nearly 10000 rows.
|  1   |    6    |   5674  |
|  1   |    7    |   3456  |
|  1   |    8    |   6590  |
|  1   |    9    |   5461  |
|  1   |    10   |   4656  |
|  2   |    0    |   2324  |
|  2   |    1    |   2343  |
|  2   |    2    |   4946  |
|  2   |    3    |   4353  |
|  2   |    4    |   4354  |
|  2   |    5    |   3234  |
|  2   |    6    |   8695  |
|  2   |    7    |   6587  |
|  2   |    8    |   5688  |
+------+---------+---------+
......more rows,nearly one billons rows

所以我的预期结果是

resultDF
+-----+-------------+--------------+----------------------------+
| id  |  min_value1 |  max_value1  |          results           |
+-----+-------------+--------------+----------------------------+            
|  1  |     0       |       3      | [4345,3434,4676,3454]      |
|  1  |     5       |       9      | [5778,5674,3456,6590,5461] |
|  2  |     1       |       4      | [2343,4946,4353,4354]      |
|  2  |     6       |       8      | [8695,6587,5688]           |
+-----+-------------+--------------+----------------------------+

我愚蠢的解决方案是

def tempFunction(id:Int,dfA:DataFrame,dfB:DataFrame): DataFrame ={
    val dfa = dfA.filter("id ="+ id)
    val dfb = dfB.filter("id ="+ id)
    val arr = dfb.groupBy("id")
                 .agg(collect_list(struct("min_value1","max_value1"))
                 .collect()

    val rangArray = arr(0)(1).asInstanceOf[Seq[Row]]   // get range array of id 
    // initial a resultDF to store each query's results
    val min_value1 = rangArray(0).get(0).asInstanceOf[Int]
    val max_value1 = rangArray(0).get(1).asInstanceOf[Int]
    val s = "value1 between "+min_value1+" and "+ max_value1
    var resultDF = dfa.filter(s).groupBy("id")
                                  .agg(collect_list("value1").as("results"),
                                   min("value1").as("min_value1"),
                                   max("value1").as("max_value1"))
    for( i <-1 to timePairArr.length-1){
       val temp_min_value1 = rangArray(0).get(0).asInstanceOf[Int]
       val temp_max_value1 = rangArray(0).get(1).asInstanceOf[Int]
       val query = "value1 between "+temp_min_value1+" and "+ temp_max_value1
       val tempResultDF = dfa.filter(query).groupBy("id")
                                  .agg(collect_list("value1").as("results"),
                                   min("value1").as("min_value1"),
                                   max("value1").as("max_value1"))
       resultDF = resultDF.union(tempResultDF)
       }

  return resultDF
}

def myFunction():DataFrame = {
  val dfA = spark.read.parquet(routeA)
  val dfB = spark.read.parquet(routeB)

  val idArrays = dfB.select("id").distinct().collect()
  // initial result
  var resultDF = tempFunction(idArrays(0).get(0).asInstanceOf[Int],dfA,dfB)
   //tranverse all id 
  for(i<-1 to idArrays.length-1){  
     val tempDF = tempFunction(idArrays(i).get(0).asInstanceOf[Int],dfA,dfB)
     resultDF = resultDF.union(tempDF)
  }
  return resultDF
}

 finalResult = null;
 for each id in dfB:
    for query condition of this id:
         tempResult = query dfA 
         union tempResult to finalResult

我试过我的算法，花了将近50个小时

有人有更有效的方法吗？非常感谢。

假设您的DFB是一个小数据集，我尝试给出以下解决方案

尝试使用

广播连接

，如下所示

import org.apache.spark.sql.functions.broadcast

dfA.join(broadcast(dfB), col("dfA.id") === col("dfB.id") && col("dfA.value1") >= col("dfB.min_value1") && col("dfA.value1") <= col("dfB.max_value1")).groupBy(col("dfA.id")).agg(collect_list(struct("value2").as("results"));

import org.apache.spark.sql.functions.broadcast
dfA.join（广播（dfB），col（“dfA.id”）===col（“dfB.id”）和&col（“dfA.value1”）>=col（“dfB.min_value1”）和&col（“dfA.value1”）很明显，当计算中涉及两个数据帧/数据集时，应该执行连接。因此，连接是您必须执行的步骤。但您应该何时连接是一个重要的问题
我建议在加入之前，聚合并尽可能减少数据帧中的行，因为这样可以减少混乱
在您的情况下，您只能减少dfA，因为您需要精确的dfB，并从满足条件的dfA中添加一列
因此，您可以groupBy
id并聚合dfA，这样您就可以得到每个id的一行，然后您就可以执行连接。然后您可以使用udf
函数进行计算逻辑
为了清晰和解释，提供了注释
import org.apache.spark.sql.functions._
//udf function to filter only the collected value2 which has value1 within range of min_value1 and max_value1 
def selectRangedValue2Udf = udf((minValue: Int, maxValue: Int, list: Seq[Row])=> list.filter(row => row.getAs[Int]("value1") <= maxValue && row.getAs[Int]("value1") >= minValue).map(_.getAs[Int]("value2")))


dfA.groupBy("id")              //grouping by id
  .agg(collect_list(struct("value1", "value2")).as("collection"))  //collecting all the value1 and value2 as structs
  .join(dfB, Seq("id"), "right")          //joining both dataframes with id
  .select(col("id"), col("min_value1"), col("max_value1"), selectRangedValue2Udf(col("min_value1"), col("max_value1"), col("collection")).as("results"))  //calling the udf function defined above

我希望答案是有帮助的好的，谢谢你的帮助！在我阅读你的答案之前，我不知道spark有这么多的连接操作。也谢谢你关于collect（）和broadcast的建议。是的，它很有效。但我还有一个小问题：collect\u list（）吗洗牌操作？感谢您多次的帮助。groupBy正在洗牌。collect\u列表正在洗牌组上聚合。
+---+----------+----------+------------------------------+
|id |min_value1|max_value1|results                       |
+---+----------+----------+------------------------------+
|1  |0         |3         |[4345, 3434, 4676, 3454]      |
|1  |5         |9         |[5778, 5674, 3456, 6590, 5461]|
|2  |1         |4         |[2343, 4946, 4353, 4354]      |
|2  |6         |8         |[8695, 6587, 5688]            |
+---+----------+----------+------------------------------+