Scala 如何连接多个列上的数据集?
给定两个Spark数据集,A和B I可以在单个列上进行连接,如下所示:Scala 如何连接多个列上的数据集?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,给定两个Spark数据集,A和B I可以在单个列上进行连接,如下所示: a.joinWith(b, $"a.col" === $"b.col", "left") 我的问题是是否可以使用多个列进行联接。本质上等同于以下数据帧api代码: a.join(b, a("col") === b("col") && a("col2") === b("col2"), "left") 您可以使用与数据帧完全相同的方法执行此操作: val xs = Seq(("a", "foo", 2.0),
a.joinWith(b, $"a.col" === $"b.col", "left")
我的问题是是否可以使用多个列进行联接。本质上等同于以下数据帧api代码:
a.join(b, a("col") === b("col") && a("col2") === b("col2"), "left")
您可以使用与
数据帧
完全相同的方法执行此操作:
val xs = Seq(("a", "foo", 2.0), ("x", "bar", -1.0)).toDS
val ys = Seq(("a", "foo", 2.0), ("y", "bar", 1.0)).toDS
xs.joinWith(ys, xs("_1") === ys("_1") && xs("_2") === ys("_2"), "left").show
// +------------+-----------+
// | _1| _2|
// +------------+-----------+
// | [a,foo,2.0]|[a,foo,2.0]|
// |[x,bar,-1.0]| null|
// +------------+-----------+
在Spark<2.0.0中,您可以使用以下内容:
xs.as("xs").joinWith(
ys.as("ys"), ($"xs._1" === $"ys._1") && ($"xs._2" === $"ys._2"), "left")
还有另一种连接方式,通过一个接一个地链接
where
。首先指定联接(以及可选的联接类型),然后指定where
运算符,即
之所以这么好,是因为Spark优化器会将连续的where
s与join
连接在一起(没有双关语)。使用explain
运算符查看底层逻辑和物理计划
scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).explain(extended = true)
== Parsed Logical Plan ==
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Analyzed Logical Plan ==
id: bigint, name: string, id: bigint, name: string
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
+- Join Inner
:- LocalRelation [id#30L, name#31]
+- LocalRelation [id#35L, name#36]
== Optimized Logical Plan ==
Join Inner, ((name#31 = name#36) && (id#30L = id#35L))
:- Filter isnotnull(name#31)
: +- LocalRelation [id#30L, name#31]
+- Filter isnotnull(name#36)
+- LocalRelation [id#35L, name#36]
== Physical Plan ==
*BroadcastHashJoin [name#31, id#30L], [name#36, id#35L], Inner, BuildRight
:- *Filter isnotnull(name#31)
: +- LocalTableScan [id#30L, name#31]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false], input[0, bigint, false]))
+- *Filter isnotnull(name#36)
+- LocalTableScan [id#35L, name#36]
在Java中,
&&
运算符不起作用。Spark Java中基于多列的正确连接方式如下:
Dataset<Row> datasetRf1 = joinedWithDays.join(
datasetFreq,
datasetFreq.col("userId").equalTo(joinedWithDays.col("userId"))
.and(datasetFreq.col("artistId").equalTo(joinedWithDays.col("artistId"))),
"inner"
);
Dataset datasetRf1=joinedWithDays.join(
datasetFreq,
datasetFreq.col(“userId”).equalTo(joinedWithDays.col(“userId”))
和(datasetFreq.col(“artistId”).equalTo(与days.col(“artistId”)合并),
“内部”
);
和
函数的工作原理与和&
运算符类似
Dataset<Row> datasetRf1 = joinedWithDays.join(
datasetFreq,
datasetFreq.col("userId").equalTo(joinedWithDays.col("userId"))
.and(datasetFreq.col("artistId").equalTo(joinedWithDays.col("artistId"))),
"inner"
);