Sql 如何在spark中建立具有多个键的查找功能_Sql_Scala_Hadoop_Apache Spark

Sql 如何在spark中建立具有多个键的查找功能

sql scala hadoop apache-spark

Sql 如何在spark中建立具有多个键的查找功能,sql,scala,hadoop,apache-spark,Sql,Scala,Hadoop,Apache Spark,我是spark的新手，上周我问了一个类似的问题。它已编译但不工作。所以我真的不知道该怎么办。我的问题是：我的表A包含3列，如下所示 ----------- A1 A1 A3 ----------- a b c ------------------------------------ B1 B2 B3 B4 B5 B6 B7 B8 B9 ------------------------------------ 1 a 3 4 5 b 7

我是spark的新手，上周我问了一个类似的问题。它已编译但不工作。所以我真的不知道该怎么办。我的问题是：我的表A包含3列，如下所示

-----------
A1  A1  A3
-----------
a    b   c

------------------------------------
B1  B2  B3  B4  B5  B6  B7  B8  B9
------------------------------------
1   a   3   4   5   b   7   8    c

还有一张像这样的B桌

-----------
A1  A1  A3
-----------
a    b   c

------------------------------------
B1  B2  B3  B4  B5  B6  B7  B8  B9
------------------------------------
1   a   3   4   5   b   7   8    c

我的逻辑是：A1 A2 A3是我的键，它对应于表B中的B2 B6 B9。我需要构建一个查找函数，该函数将A1 A2 A3作为键并返回B8

这是我上周尝试的：

//getting the data in to dataframe
val clsrowRDD = clsfile.map(_.split("\t")).map(p => Row(p(0),p(1),p(2),p(3),p(4),p(5),p(6),p(7),p(8)))
val clsDataFrame = sqlContext.createDataFrame(clsrowRDD, clsschema)

//mapping the three key with the value
val smallRdd = clsDataFrame.rdd.map{row: Row => (mutable.WrappedArray.make[String](Array(row.getString(1), row.getString(5), row.getString(8))), row.getString(7))}

val lookupMap:Map[mutable.WrappedArray[String], String] = smallRdd.collectAsMap()

//build the look up function
def lookup(lookupMap: Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))

//call the function
val combinedDF  = mstrDataFrame.withColumn("ENTP_CLS_CD",lookup(lookupMap)($"SRC_SYS_CD",$"ORG_ID",$"ORG_CD"))

这段代码可以编译，但并没有真正返回我需要的结果。我认为这是因为我传入了一个数组作为键，而我的表中实际上没有数组。但是当我尝试将映射类型更改为

map[（String，String，String），String]

时，我不知道如何在函数中传递它

非常感谢。

如果您试图为

A1

与

B2

和

A2

与

B6

和

A3

与

B9

的每一次匹配获取

B8

值，那么简单的

join

和

select

方法应该可以做到这一点创建查找映射会增加复杂性。

正如您所解释的，您必须将数据帧

df1

和

df2

作为

+---+---+---+
|A1 |A2 |A3 |
+---+---+---+
|a  |b  |c  |
+---+---+---+

+---+---+---+---+---+---+---+---+---+
|B1 |B2 |B3 |B4 |B5 |B6 |B7 |B8 |B9 |
+---+---+---+---+---+---+---+---+---+
|1  |a  |3  |4  |5  |b  |7  |8  |c  |
|1  |a  |3  |4  |5  |b  |7  |8  |e  |
+---+---+---+---+---+---+---+---+---+

可以进行简单的

join

和

select

df1.join(df2, $"A1" === $"B2" && $"A2" === $"B6" && $"A3" === $"B9", "inner").select("B8")

应该给你什么

+---+
|B8 |
+---+
|8  |
+---+

我希望答案是有帮助的

已更新

根据我从下面的问题和评论中了解到的情况，您对如何将

array

传递给

lookup

udf

函数感到困惑。为此，您可以使用函数。我已经修改了您几乎完美的代码的某些部分，使其能够工作

//mapping the three key with the value
val smallRdd = clsDataFrame.rdd
  .map{row: Row => (mutable.WrappedArray.make[String](Array(row.getString(1), row.getString(5), row.getString(8))), row.getString(7))}

val lookupMap: collection.Map[mutable.WrappedArray[String], String] = smallRdd.collectAsMap()

//build the look up function
def lookup(lookupMap: collection.Map[mutable.WrappedArray[String],String]) =
udf((input: mutable.WrappedArray[String]) => lookupMap.lift(input))

//call the function
val combinedDF  = mstrDataFrame.withColumn("ENTP_CLS_CD",lookup(lookupMap)(array($"SRC_SYS_CD",$"ORG_ID",$"ORG_CD")))

你应该

+----------+------+------+-----------+
|SRC_SYS_CD|ORG_ID|ORG_CD|ENTP_CLS_CD|
+----------+------+------+-----------+
|a         |b     |c     |8          |
+----------+------+------+-----------+

您更喜欢lookupMap而不是join？：）不，我宁愿使用连接，只是我的要求是不使用连接…所以我的答案有帮助吗？但你刚才说你会使用连接，不是吗？更像是他们已经使用连接了，希望我探索如何使用查找映射