Apache spark 从引用表和主id中获取列作为连接列';来自数据集的数据
我正在尝试使用下面的数据集将连接的数据作为单个列 示例DS:Apache spark 从引用表和主id中获取列作为连接列';来自数据集的数据,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我正在尝试使用下面的数据集将连接的数据作为单个列 示例DS: val df = sc.parallelize(Seq( ("a", 1,2,3), ("b", 4,6,5) )).toDF("value", "id1", "id2", "id3") +-------+-----+-----+-----+ | value | id1 | id2 | id3 | +-------+-----+-----+-----+ | a | 1 | 2 | 3 | | b
val df = sc.parallelize(Seq(
("a", 1,2,3),
("b", 4,6,5)
)).toDF("value", "id1", "id2", "id3")
+-------+-----+-----+-----+
| value | id1 | id2 | id3 |
+-------+-----+-----+-----+
| a | 1 | 2 | 3 |
| b | 4 | 6 | 5 |
+-------+-----+-----+-----+
从参考数据集
+----+----------+--------+
| id | descr | parent|
+----+----------+--------+
| 1 | apple | fruit |
| 2 | banana | fruit |
| 3 | cat | animal |
| 4 | dog | animal |
| 5 | elephant | animal |
| 6 | Flight | object |
+----+----------+--------+
val ref= sc.parallelize(Seq(
(1,"apple","fruit"),
(2,"banana","fruit"),
(3,"cat","animal"),
(4,"dog","animal"),
(5,"elephant","animal"),
(6,"Flight","object"),
)).toDF("id", "descr", "parent")
我正在尝试获得以下所需的输出
+-----------------------+--------------------------+
| desc | parent |
+-----------------------+--------------------------+
| apple+banana+cat/M | fruit+fruit+animal/M |
| dog+Flight+elephant/M | animal+object+animal/M |
+-----------------------+--------------------------+
而且我只需要在(id2,id3)不为null时concat。否则仅使用id1
我绞尽脑汁寻找解决办法 分解第一个数据帧
df
并加入ref
,然后加入groupBy
,应该可以按照您的预期工作
val dfNew = df.withColumn("id", explode(array("id1", "id2", "id3")))
.select("id", "value")
ref.join(dfNew, Seq("id"))
.groupBy("value")
.agg(
concat_ws("+", collect_list("descr")) as "desc",
concat_ws("+", collect_list("parent")) as "parent"
)
.drop("value")
.show()
输出:
+-------------------+--------------------+
|desc |parent |
+-------------------+--------------------+
|Flight+elephant+dog|object+animal+animal|
|apple+cat+banana |fruit+animal+fruit |
+-------------------+--------------------+