Scala Spark查找返回小于等于键的值
我试图在spark udf中创建一个查找,使用col1和col2查找a中的值,并使用condition从表B中获取剩余的列Scala Spark查找返回小于等于键的值,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我试图在spark udf中创建一个查找,使用col1和col2查找a中的值,并使用condition从表B中获取剩余的列 tableA.col1=tableB.col1和tableA.col2是否有任何原因需要使用需要收集数据帧数据的查找UDF,从而限制数据帧的大小 如果您的目标只是生成所需的输出数据帧,那么以下方法不会对数据帧大小施加不必要的限制: // get values val values = df.select("col1,col2").map(r => r.toString
tableA.col1=tableB.col1和tableA.col2是否有任何原因需要使用需要收集数据帧数据的查找UDF,从而限制数据帧的大小 如果您的目标只是生成所需的输出数据帧,那么以下方法不会对数据帧大小施加不必要的限制:
// get values
val values = df.select("col1,col2").map(r => r.toString()).collect.toList
//get keys
val keys = enriched_2080.select($"col3",$"col4").map(r => (r.getString(0),r.getLong(1))).collect.toList
//create a map
val lookup_map = keys.zip(values).toMap
//udf
val lookup_udf = udf{ (a:String,b:Long) =>
(a,b) match {case (x:String,y:Long) => (a,b) match {case (x:String,y:Long) => lookup_map.getOrElse((x, y),"")}
}
//call udf
df1.withColumn("result", lookup_udf(df1("col1"),
df1("col2"))).show(false)
如果你看一下表A中的
A
123和134,它们都小于表B中的147。那么你如何假设它会与134结合呢only@RameshMaharjan-输出应为小于等于col1上连接的col2的最大值
Output:
------------------------
col1|col2|col3|col4|
------------------------
A |129 |d1 |d2 |
------------------------
A |147 |d3 |d4 |
------------------------
B |199 |d7 |d8 |
------------------------
B |175 |d5 |d6 |
------------------------
// get values
val values = df.select("col1,col2").map(r => r.toString()).collect.toList
//get keys
val keys = enriched_2080.select($"col3",$"col4").map(r => (r.getString(0),r.getLong(1))).collect.toList
//create a map
val lookup_map = keys.zip(values).toMap
//udf
val lookup_udf = udf{ (a:String,b:Long) =>
(a,b) match {case (x:String,y:Long) => (a,b) match {case (x:String,y:Long) => lookup_map.getOrElse((x, y),"")}
}
//call udf
df1.withColumn("result", lookup_udf(df1("col1"),
df1("col2"))).show(false)
val dfA = Seq(
("A", 123L, "d1", "d2"),
("A", 134L, "d3", "d4"),
("B", 156L, "d5", "d6"),
("B", 178L, "d7", "d8")
).toDF("col1", "col2", "col3", "col4")
val dfB = Seq(
("A", 129L),
("A", 147L),
("B", 199L),
("B", 175L)
).toDF("col1", "col2")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
// Create a DataFrame with all `dfB.col2 - dfA.col2` values that are >= 0
val dfDiff = dfA.join(dfB, Seq("col1")).
select(
dfA("col1"), dfA("col2").as("col2a"), dfB("col2").as("col2b"),
dfA("col3"), dfA("col4"), (dfB("col2") - dfA("col2")).as("diff")
).
where($"diff" >= 0)
dfDiff.show
// +----+-----+-----+----+----+----+
// |col1|col2a|col2b|col3|col4|diff|
// +----+-----+-----+----+----+----+
// | A| 123| 147| d1| d2| 24|
// | A| 123| 129| d1| d2| 6|
// | A| 134| 147| d3| d4| 13|
// | B| 156| 175| d5| d6| 19|
// | B| 156| 199| d5| d6| 43|
// | B| 178| 199| d7| d8| 21|
// +----+-----+-----+----+----+----+
// Create result dataset with minimum `diff` for every `(col1, col2)` in dfA
// and assign corresponding `dfB.col2` as the new `col2`
val dfResult = dfDiff.withColumn( "rank",
rank.over(Window.partitionBy($"col1", $"col2a").orderBy($"diff"))
).
where($"rank" === 1).
select( $"col1", $"col2b".as("col2"), $"col3", $"col4" )
dfResult.show
// +----+----+----+----+
// |col1|col2|col3|col4|
// +----+----+----+----+
// | A| 147| d3| d4|
// | B| 175| d5| d6|
// | A| 129| d1| d2|
// | B| 199| d7| d8|
// +----+----+----+----+