Apache spark ApacheSpark-为什么withColumn数据帧的联合顺序会产生不同的连接结果?
环境:Apache spark ApacheSpark-为什么withColumn数据帧的联合顺序会产生不同的连接结果?,apache-spark,apache-spark-sql,spark-dataframe,Apache Spark,Apache Spark Sql,Spark Dataframe,环境: 操作系统:Windows7 Spark:版本2.1.0 斯卡拉:2.11.8 Java:1.8 火花壳 scala> val info = Seq((10, "A"), (100, "B")).toDF("id", "type") info: org.apache.spark.sql.DataFrame = [id: int, type: string] scala> val statC = Seq((1)).toDF("sid").withColumn("stype"
- 操作系统:Windows7
- Spark:版本2.1.0
- 斯卡拉:2.11.8
- Java:1.8
scala> val info = Seq((10, "A"), (100, "B")).toDF("id", "type")
info: org.apache.spark.sql.DataFrame = [id: int, type: string]
scala> val statC = Seq((1)).toDF("sid").withColumn("stype", lit("A"))
statC: org.apache.spark.sql.DataFrame = [sid: int, stype: string]
scala> val statD = Seq((2)).toDF("sid").withColumn("stype", lit("B"))
statD: org.apache.spark.sql.DataFrame = [sid: int, stype: string]
scala> info.join(statC.union(statD), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
| 10| A| 1| A|
+---+----+---+-----+
scala> info.join(statD.union(statC), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
+---+----+---+-----+
statC
和statD
通过WithColumn
生成列stype
,上面的REPL显示
statC.union(statD)
和statD.union(statC)
使连接结果不同
我解释了两个连接的物理平面
scala> info.join(statC.union(statD), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0)], [cast(sid#420 as double)], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
: +- *Filter ((isnotnull(_2#340) && ((A <=> _2#340) || (B <=> _2#340))) && (_2#340 = A))
: +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double)))
+- Union
:- LocalTableScan [sid#420, stype#423]
+- LocalTableScan [sid#430, stype#433]
scala> info.join(statD.union(statC), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0)], [cast(sid#430 as double)], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
: +- *Filter ((isnotnull(_2#340) && ((B <=> _2#340) || (A <=> _2#340))) && (_2#340 = B))
: +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double)))
+- Union
:- LocalTableScan [sid#430, stype#433]
+- LocalTableScan [sid#420, stype#423]
当statD.union(statC)
时,过滤条件为:
Filter ((isnotnull(_2#340) && ((A <=> _2#340) || (B <=> _2#340))) && (_2#340 = A))
Filter ((isnotnull(_2#340) && ((B <=> _2#340) || (A <=> _2#340))) && (_2#340 = B))
解释表明,在statA/statB中,id/type
和sid/stype
都包含在BroadcastHashJoin中,但在statC/statD中,只有id
和sid
包含在BroadcastHashJoin中
当withColumn生成的数据帧上的并集顺序发生更改时,为什么join具有不同的语义 这肯定是一个错误。我可以确认该错误在版本2.1.0中完全可以复制,但在版本2.0.0中不存在。我想你应该提交一份bug报告。谢谢@GlennieHellesSindholt,我在spark发行版中找到了一份bug报告:,可能会在2.1.x中修复
scala> val info = Seq((10, "A"), (100, "B")).toDF("id", "type")
info: org.apache.spark.sql.DataFrame = [id: int, type: string]
scala> val statA = Seq((1, "A")).toDF("sid", "stype")
statA: org.apache.spark.sql.DataFrame = [sid: int, stype: string]
scala> val statB = Seq((2, "B")).toDF("sid", "stype")
statB: org.apache.spark.sql.DataFrame = [sid: int, stype: string]
scala> info.join(statA.union(statB), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
| 10| A| 1| A|
+---+----+---+-----+
scala> info.join(statB.union(statA), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
| 10| A| 1| A|
+---+----+---+-----+
scala> info.join(statA.union(statB), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0), type#343], [cast(sid#352 as double), stype#353], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
: +- *Filter isnotnull(_2#340)
: +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double), input[1, string, true]))
+- Union
:- *Project [_1#349 AS sid#352, _2#350 AS stype#353]
: +- *Filter isnotnull(_2#350)
: +- LocalTableScan [_1#349, _2#350]
+- *Project [_1#359 AS sid#362, _2#360 AS stype#363]
+- *Filter isnotnull(_2#360)
+- LocalTableScan [_1#359, _2#360]
scala> info.join(statB.union(statA), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0), type#343], [cast(sid#362 as double), stype#363], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
: +- *Filter isnotnull(_2#340)
: +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double), input[1, string, true]))
+- Union
:- *Project [_1#359 AS sid#362, _2#360 AS stype#363]
: +- *Filter isnotnull(_2#360)
: +- LocalTableScan [_1#359, _2#360]
+- *Project [_1#349 AS sid#352, _2#350 AS stype#353]
+- *Filter isnotnull(_2#350)
+- LocalTableScan [_1#349, _2#350]