Apache spark Spark SQL-生成匹配数组而不是每个匹配一行的连接
我在Spark SQL(特别是Java)中工作,当连接条件有多个匹配项时,在连接时遇到问题 我在输出中接收到每个匹配项的一行,但我希望将这些匹配项合并到一个与连接条件匹配的值数组中 假设我有以下两张表: 地点Apache spark Spark SQL-生成匹配数组而不是每个匹配一行的连接,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我在Spark SQL(特别是Java)中工作,当连接条件有多个匹配项时,在连接时遇到问题 我在输出中接收到每个匹配项的一行,但我希望将这些匹配项合并到一个与连接条件匹配的值数组中 假设我有以下两张表: 地点 location | animal1 | animal2 | animal3 --------------------------------------- australia | badger | duck | penguin thailand | moose | pen
location | animal1 | animal2 | animal3
---------------------------------------
australia | badger | duck | penguin
thailand | moose | penguin | horse
brazil | zebra | cow | pigeon
mexico | rhino | donkey | cat
banned_animal | banned_animal_ID
--------------------------------
penguin | 1
zebra | 2
moose | 3
banndanimals
location | animal1 | animal2 | animal3
---------------------------------------
australia | badger | duck | penguin
thailand | moose | penguin | horse
brazil | zebra | cow | pigeon
mexico | rhino | donkey | cat
banned_animal | banned_animal_ID
--------------------------------
penguin | 1
zebra | 2
moose | 3
我想做的是,组装一个包含位置的表,然后是一个包含所有在那里被禁止的动物ID的列。例如,上述两个表格将产生:
location | banned_animal_IDs
--------------------------------
australia | [1]
thailand | [1,3]
brazil | [2]
我不关心数组中ID的顺序,如果有多个,那么对于泰国
条目,我同样满意[1,3]
和[3,1]
我现在得到的不是我想要的,是:
location | banned_animal_IDs
--------------------------------
australia | 1
thailand | 1
thailand | 3
brazil | 2
我这样做的方式:
Dataset<Row> bannedAnimalsByLocation = locations
.join(bannedAdminals, joinColumn, "INNER");
Dataset bannedanismalsbylocation=位置
.连接(横幅、连接栏、“内部”);
其中,join列
是禁止的动物列
locations
表中可能有许多其他列,因此我不能只在location
列上执行.groupBy
。您可以尝试以下操作:
import org.apache.spark.sql.functions._
val locations = sc.parallelize(Seq(
("australia", "badger", "duck", "penguin"),
("thailand", "moosen", "penguin", "horse"),
("brazil", "zebra", "cow", "pigeon"),
("mexico", "rhino", "donkey", "cat")
)).toDF("location","animal1", "animal2", "animal3")
val bannedAdminals = sc.parallelize(Seq(
("penguin", "1"),
("zebra", "2"),
("moosen", "3")
)).toDF("banned_animal", "banned_animal_ID")
val dfJoined = locations.join(bannedAdminals, locations("animal1") === bannedAdminals("banned_animal")
or locations("animal2") === bannedAdminals("banned_animal")
or locations("animal3") === bannedAdminals("banned_animal"))
.select("location", "banned_animal_ID")
dfJoined.groupBy("location").agg(collect_set("banned_animal_ID")).show
结果是:
+---------+-----------------------------+
| location|collect_set(banned_animal_ID)|
+---------+-----------------------------+
|australia| [1]|
| thailand| [3, 1]|
| brazil| [2]|
+---------+-----------------------------+
如果不想丢失其他列,请尝试使用带有
collect\u set
的窗口函数作为聚合函数。然后选取条目最多的行并过滤所有其他行。banked\u animal
如何成为联接列,它不会出现在位置
数据框中?