Scala 如何用另一个小数据帧(逐行)和xFF1F多次过滤一个大数据帧(等于小df的行数);
我有两个spark数据帧,Scala 如何用另一个小数据帧(逐行)和xFF1F多次过滤一个大数据帧(等于小df的行数);,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有两个spark数据帧,dfA和dfB。 我想按dfB的每一行过滤dfA,这意味着如果dfB有10000行,我需要用dfB生成的10000个不同过滤条件过滤dfA 10000次。然后,在每次筛选之后,我需要在dfB中将筛选结果收集为一列 dfA dfB +------+---------+---------+ +-----+-------------+--------------+ | id | v
dfA
和dfB
。
我想按dfB
的每一行过滤dfA
,这意味着如果dfB
有10000行,我需要用dfB
生成的10000个不同过滤条件过滤dfA 10000次。然后,在每次筛选之后,我需要在dfB
中将筛选结果收集为一列
dfA dfB
+------+---------+---------+ +-----+-------------+--------------+
| id | value1 | value2 | | id | min_value1 | max_value1 |
+------+---------+---------+ +-----+-------------+--------------+
| 1 | 0 | 4345 | | 1 | 0 | 3 |
| 1 | 1 | 3434 | | 1 | 5 | 9 |
| 1 | 2 | 4676 | | 2 | 1 | 4 |
| 1 | 3 | 3454 | | 2 | 6 | 8 |
| 1 | 4 | 9765 | +-----+-------------+--------------+
| 1 | 5 | 5778 | ....more rows, nearly 10000 rows.
| 1 | 6 | 5674 |
| 1 | 7 | 3456 |
| 1 | 8 | 6590 |
| 1 | 9 | 5461 |
| 1 | 10 | 4656 |
| 2 | 0 | 2324 |
| 2 | 1 | 2343 |
| 2 | 2 | 4946 |
| 2 | 3 | 4353 |
| 2 | 4 | 4354 |
| 2 | 5 | 3234 |
| 2 | 6 | 8695 |
| 2 | 7 | 6587 |
| 2 | 8 | 5688 |
+------+---------+---------+
......more rows,nearly one billons rows
所以我的预期结果是
resultDF
+-----+-------------+--------------+----------------------------+
| id | min_value1 | max_value1 | results |
+-----+-------------+--------------+----------------------------+
| 1 | 0 | 3 | [4345,3434,4676,3454] |
| 1 | 5 | 9 | [5778,5674,3456,6590,5461] |
| 2 | 1 | 4 | [2343,4946,4353,4354] |
| 2 | 6 | 8 | [8695,6587,5688] |
+-----+-------------+--------------+----------------------------+
我愚蠢的解决方案是
def tempFunction(id:Int,dfA:DataFrame,dfB:DataFrame): DataFrame ={
val dfa = dfA.filter("id ="+ id)
val dfb = dfB.filter("id ="+ id)
val arr = dfb.groupBy("id")
.agg(collect_list(struct("min_value1","max_value1"))
.collect()
val rangArray = arr(0)(1).asInstanceOf[Seq[Row]] // get range array of id
// initial a resultDF to store each query's results
val min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val s = "value1 between "+min_value1+" and "+ max_value1
var resultDF = dfa.filter(s).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
for( i <-1 to timePairArr.length-1){
val temp_min_value1 = rangArray(0).get(0).asInstanceOf[Int]
val temp_max_value1 = rangArray(0).get(1).asInstanceOf[Int]
val query = "value1 between "+temp_min_value1+" and "+ temp_max_value1
val tempResultDF = dfa.filter(query).groupBy("id")
.agg(collect_list("value1").as("results"),
min("value1").as("min_value1"),
max("value1").as("max_value1"))
resultDF = resultDF.union(tempResultDF)
}
return resultDF
}
def myFunction():DataFrame = {
val dfA = spark.read.parquet(routeA)
val dfB = spark.read.parquet(routeB)
val idArrays = dfB.select("id").distinct().collect()
// initial result
var resultDF = tempFunction(idArrays(0).get(0).asInstanceOf[Int],dfA,dfB)
//tranverse all id
for(i<-1 to idArrays.length-1){
val tempDF = tempFunction(idArrays(i).get(0).asInstanceOf[Int],dfA,dfB)
resultDF = resultDF.union(tempDF)
}
return resultDF
}
finalResult = null;
for each id in dfB:
for query condition of this id:
tempResult = query dfA
union tempResult to finalResult
我试过我的算法,花了将近50个小时
有人有更有效的方法吗?非常感谢。假设您的DFB是一个小数据集,我尝试给出以下解决方案 尝试使用
广播连接
,如下所示
import org.apache.spark.sql.functions.broadcast
dfA.join(broadcast(dfB), col("dfA.id") === col("dfB.id") && col("dfA.value1") >= col("dfB.min_value1") && col("dfA.value1") <= col("dfB.max_value1")).groupBy(col("dfA.id")).agg(collect_list(struct("value2").as("results"));
import org.apache.spark.sql.functions.broadcast
dfA.join(广播(dfB),col(“dfA.id”)===col(“dfB.id”)和&col(“dfA.value1”)>=col(“dfB.min_value1”)和&col(“dfA.value1”)很明显,当计算中涉及两个数据帧/数据集时,应该执行连接。因此,连接是您必须执行的步骤。但您应该何时连接是一个重要的问题
我建议在加入之前,聚合并尽可能减少数据帧中的行,因为这样可以减少混乱
在您的情况下,您只能减少dfA,因为您需要精确的dfB,并从满足条件的dfA中添加一列
因此,您可以groupBy
id并聚合dfA,这样您就可以得到每个id的一行,然后您就可以执行连接。然后您可以使用udf
函数进行计算逻辑
为了清晰和解释,提供了注释
import org.apache.spark.sql.functions._
//udf function to filter only the collected value2 which has value1 within range of min_value1 and max_value1
def selectRangedValue2Udf = udf((minValue: Int, maxValue: Int, list: Seq[Row])=> list.filter(row => row.getAs[Int]("value1") <= maxValue && row.getAs[Int]("value1") >= minValue).map(_.getAs[Int]("value2")))
dfA.groupBy("id") //grouping by id
.agg(collect_list(struct("value1", "value2")).as("collection")) //collecting all the value1 and value2 as structs
.join(dfB, Seq("id"), "right") //joining both dataframes with id
.select(col("id"), col("min_value1"), col("max_value1"), selectRangedValue2Udf(col("min_value1"), col("max_value1"), col("collection")).as("results")) //calling the udf function defined above
我希望答案是有帮助的好的,谢谢你的帮助!在我阅读你的答案之前,我不知道spark有这么多的连接操作。也谢谢你关于collect()和broadcast的建议。是的,它很有效。但我还有一个小问题:collect\u list()吗洗牌操作?感谢您多次的帮助。groupBy正在洗牌。collect\u列表正在洗牌组上聚合。
+---+----------+----------+------------------------------+
|id |min_value1|max_value1|results |
+---+----------+----------+------------------------------+
|1 |0 |3 |[4345, 3434, 4676, 3454] |
|1 |5 |9 |[5778, 5674, 3456, 6590, 5461]|
|2 |1 |4 |[2343, 4946, 4353, 4354] |
|2 |6 |8 |[8695, 6587, 5688] |
+---+----------+----------+------------------------------+