Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/ant/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 从引用表和主id中获取列作为连接列';来自数据集的数据_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark 从引用表和主id中获取列作为连接列';来自数据集的数据

Apache spark 从引用表和主id中获取列作为连接列';来自数据集的数据,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我正在尝试使用下面的数据集将连接的数据作为单个列 示例DS: val df = sc.parallelize(Seq( ("a", 1,2,3), ("b", 4,6,5) )).toDF("value", "id1", "id2", "id3") +-------+-----+-----+-----+ | value | id1 | id2 | id3 | +-------+-----+-----+-----+ | a | 1 | 2 | 3 | | b

我正在尝试使用下面的数据集将连接的数据作为单个列

示例DS:

val df = sc.parallelize(Seq(
  ("a", 1,2,3),
  ("b", 4,6,5)
)).toDF("value", "id1", "id2", "id3")

+-------+-----+-----+-----+
| value | id1 | id2 | id3 |
+-------+-----+-----+-----+
| a     |   1 |   2 |   3 |
| b     |   4 |   6 |   5 |
+-------+-----+-----+-----+
从参考数据集

+----+----------+--------+
| id |   descr   | parent|
+----+----------+--------+
|  1 | apple    | fruit  |
|  2 | banana   | fruit  |
|  3 | cat      | animal |
|  4 | dog      | animal |
|  5 | elephant | animal |
|  6 | Flight   | object |
+----+----------+--------+

val ref= sc.parallelize(Seq(
  (1,"apple","fruit"),
  (2,"banana","fruit"),
  (3,"cat","animal"),
  (4,"dog","animal"),
  (5,"elephant","animal"),
  (6,"Flight","object"),
)).toDF("id", "descr", "parent")
我正在尝试获得以下所需的输出

+-----------------------+--------------------------+
|         desc          |          parent          |
+-----------------------+--------------------------+
| apple+banana+cat/M    | fruit+fruit+animal/M     |
| dog+Flight+elephant/M | animal+object+animal/M   |
+-----------------------+--------------------------+
而且我只需要在(id2,id3)不为null时concat。否则仅使用id1


我绞尽脑汁寻找解决办法

分解第一个数据帧
df
并加入
ref
,然后加入
groupBy
,应该可以按照您的预期工作

val dfNew = df.withColumn("id", explode(array("id1", "id2", "id3")))
  .select("id", "value")

ref.join(dfNew, Seq("id"))
  .groupBy("value")
  .agg(
    concat_ws("+", collect_list("descr")) as "desc", 
    concat_ws("+", collect_list("parent")) as "parent"
  )
  .drop("value")
  .show()
输出:

+-------------------+--------------------+
|desc               |parent              |
+-------------------+--------------------+
|Flight+elephant+dog|object+animal+animal|
|apple+cat+banana   |fruit+animal+fruit  |
+-------------------+--------------------+