Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何连接多个列上的数据集?_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 如何连接多个列上的数据集?

Scala 如何连接多个列上的数据集?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,给定两个Spark数据集,A和B I可以在单个列上进行连接,如下所示: a.joinWith(b, $"a.col" === $"b.col", "left") 我的问题是是否可以使用多个列进行联接。本质上等同于以下数据帧api代码: a.join(b, a("col") === b("col") && a("col2") === b("col2"), "left") 您可以使用与数据帧完全相同的方法执行此操作: val xs = Seq(("a", "foo", 2.0),

给定两个Spark数据集,A和B I可以在单个列上进行连接,如下所示:

a.joinWith(b, $"a.col" === $"b.col", "left")
我的问题是是否可以使用多个列进行联接。本质上等同于以下数据帧api代码:

a.join(b, a("col") === b("col") && a("col2") === b("col2"), "left")

您可以使用与
数据帧
完全相同的方法执行此操作:

val xs = Seq(("a", "foo", 2.0), ("x", "bar", -1.0)).toDS
val ys = Seq(("a", "foo", 2.0), ("y", "bar", 1.0)).toDS

xs.joinWith(ys, xs("_1") === ys("_1") && xs("_2") === ys("_2"), "left").show
// +------------+-----------+
// |          _1|         _2|
// +------------+-----------+
// | [a,foo,2.0]|[a,foo,2.0]|
// |[x,bar,-1.0]|       null|
// +------------+-----------+
在Spark<2.0.0中,您可以使用以下内容:

xs.as("xs").joinWith(
  ys.as("ys"), ($"xs._1" === $"ys._1") && ($"xs._2" === $"ys._2"), "left")

还有另一种连接方式,通过一个接一个地链接
where
。首先指定联接(以及可选的联接类型),然后指定
where
运算符,即

之所以这么好,是因为Spark优化器会将连续的
where
s与
join
连接在一起(没有双关语)。使用
explain
运算符查看底层逻辑和物理计划

scala> as.join(bs).where(as("id") === bs("id")).where(as("name") === bs("name")).explain(extended = true)
== Parsed Logical Plan ==
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
   +- Join Inner
      :- LocalRelation [id#30L, name#31]
      +- LocalRelation [id#35L, name#36]

== Analyzed Logical Plan ==
id: bigint, name: string, id: bigint, name: string
Filter (name#31 = name#36)
+- Filter (id#30L = id#35L)
   +- Join Inner
      :- LocalRelation [id#30L, name#31]
      +- LocalRelation [id#35L, name#36]

== Optimized Logical Plan ==
Join Inner, ((name#31 = name#36) && (id#30L = id#35L))
:- Filter isnotnull(name#31)
:  +- LocalRelation [id#30L, name#31]
+- Filter isnotnull(name#36)
   +- LocalRelation [id#35L, name#36]

== Physical Plan ==
*BroadcastHashJoin [name#31, id#30L], [name#36, id#35L], Inner, BuildRight
:- *Filter isnotnull(name#31)
:  +- LocalTableScan [id#30L, name#31]
+- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, false], input[0, bigint, false]))
   +- *Filter isnotnull(name#36)
      +- LocalTableScan [id#35L, name#36]

在Java中,
&&
运算符不起作用。Spark Java中基于多列的正确连接方式如下:

            Dataset<Row> datasetRf1 = joinedWithDays.join(
                    datasetFreq, 
                    datasetFreq.col("userId").equalTo(joinedWithDays.col("userId"))
                    .and(datasetFreq.col("artistId").equalTo(joinedWithDays.col("artistId"))),
                            "inner"
                    );
Dataset datasetRf1=joinedWithDays.join(
datasetFreq,
datasetFreq.col(“userId”).equalTo(joinedWithDays.col(“userId”))
和(datasetFreq.col(“artistId”).equalTo(与days.col(“artistId”)合并),
“内部”
);
函数的工作原理与
和&
运算符类似

            Dataset<Row> datasetRf1 = joinedWithDays.join(
                    datasetFreq, 
                    datasetFreq.col("userId").equalTo(joinedWithDays.col("userId"))
                    .and(datasetFreq.col("artistId").equalTo(joinedWithDays.col("artistId"))),
                            "inner"
                    );