Apache spark Spark中等分数据帧的合并

Apache spark Spark中等分数据帧的合并,apache-spark,apache-spark-sql,spark-dataframe,Apache Spark,Apache Spark Sql,Spark Dataframe,在Hadoop中,只需使用带有CompositeInputFormat的映射端连接,就可以在不重新排列的情况下完成大型等分区数据集的连接/合并,并减少阶段 试图找出在Spark中执行此操作的方法: val x = sc.parallelize(Seq(("D", 1), ("C", 2), ("B", 3), ("A", 4))).toDF("k", "v") .repartition(col("k")).cache() val y = sc.parallelize(Seq(("F",

在Hadoop中,只需使用带有CompositeInputFormat的映射端连接,就可以在不重新排列的情况下完成大型等分区数据集的连接/合并,并减少阶段

试图找出在Spark中执行此操作的方法:

val x = sc.parallelize(Seq(("D", 1), ("C", 2), ("B", 3), ("A", 4))).toDF("k", "v")
    .repartition(col("k")).cache()
val y = sc.parallelize(Seq(("F", 5), ("E", 6), ("D", 7), ("C", 8))).toDF("k", "v")
    .repartition(col("k")).cache()

val xy = x.join(y, x.col("k") === y.col("k"), "outer")

x.show()    y.show()    xy.show()

+---+---+   +---+---+   +----+----+----+----+
|  k|  v|   |  k|  v|   |   k|   v|   k|   v|
+---+---+   +---+---+   +----+----+----+----+
|  A|  6|   |  C| 12|   |   A|   4|null|null|
|  B|  5|   |  D| 11|   |   B|   3|null|null|
|  C|  4|   |  E| 10|   |   C|   2|   C|   8|
|  D|  3|   |  F|  9|   |   D|   1|   D|   7|
|  E|  2|   |  G|  8|   |null|null|   E|   6|
|  F|  1|   |  H|  7|   |null|null|   F|   5|
+---+---+   +---+---+   +----+----+----+----+
到目前为止还不错。 但当我检查执行计划时,我看到了不必要的分类:

xy.explain

== Physical Plan ==
SortMergeOuterJoin [k#1283], [k#1297], FullOuter, None
:- Sort [k#1283 ASC], false, 0
:  +- InMemoryColumnarTableScan [k#1283,v#1284], InMemoryRelation [k#1283,v#1284], true, 10000, StorageLevel(true, true, false, true, 1), TungstenExchange hashpartitioning(k#1283,200), None, None
+- Sort [k#1297 ASC], false, 0
   +- InMemoryColumnarTableScan [k#1297,v#1298], InMemoryRelation [k#1297,v#1298], true, 10000, StorageLevel(true, true, false, true, 1), TungstenExchange hashpartitioning(k#1297,200), None, None
这里可以避免排序吗

编辑

作为参考,Hadoop自2007年起就提供了此功能:

更新

正如Lezzar指出的,仅仅重新分区不足以实现等分区排序状态。 我认为现在需要在它后面加上肮脏的污点 因此,这应该可以做到:

val x = sc.parallelize(Seq(("F", 1), ("E", 2), ("D", 3), ("C", 4), ("B", 5), ("A", 6))).toDF("k", "v")
    .repartition(col("k")).sortWithinPartitions(col("k")).cache()
val y = sc.parallelize(Seq(("H", 7), ("G", 8), ("F", 9), ("E",10), ("D",11), ("C",12))).toDF("k", "v")
    .repartition(col("k")).sortWithinPartitions(col("k")).cache()
xy.解释

== Physical Plan ==
SortMergeOuterJoin [k#1055], [k#1069], FullOuter, None
:- InMemoryColumnarTableScan [k#1055,v#1056], InMemoryRelation [k#1055,v#1056], true, 10000, StorageLevel(true, true, false, true, 1), Sort [k#1055 ASC], false, 0, None
+- InMemoryColumnarTableScan [k#1069,v#1070], InMemoryRelation [k#1069,v#1070], true, 10000, StorageLevel(true, true, false, true, 1), Sort [k#1069 ASC], false, 0, None

别再分类了

与Hadoop中的Map-side连接类似,Spark具有广播连接,它将表数据传输给所有工作者,就像分布式缓存在Hadoop mapreduce中所做的一样。请参考spark文档或搜索一次spark广播哈希连接。与蜂巢不同,Spark会自动照顾它。所以,不用担心

不过,您需要了解一些参数

->spark.sql.autoBroadcastJoinThreshold,spark自动广播表的大小

您可以尝试下面的代码来理解广播连接,也可以参考spark文档了解广播连接或google it了解更多详细信息

要尝试的示例代码:

val sqlContext = new HiveContext(sc);
1) sqlContext.sql("CREATE TABLE IF NOT EXISTS tab3 (key INT, value STRING)")

2) sqlContext.sql("INSERT INTO tab4 select 1,\"srini\" from sr23");
(I have created other table to just insert a record into table. As hive only support insert into select, i have used this trick to have some data.) You can skip this step as well, as you just want to see the physical plan.

------ You can also use any Hive table that is already created instead.. I am just trying to simulate the hive table thats it. --- 

3) val srini_df1 = sqlContext.sql("ANALYZE TABLE tab4 COMPUTE STATISTICS NOSCAN");

4) val df2 = sc.parallelize(Seq((5,"F"), (6,"E"), (7,"sri"), (1,"test"))).toDF("key", "value")

5) val join_df = sqlContext.sql("SELECT * FROM tab5").join(df2,"key");

6) join_df.explain
16/03/15 22:40:09 INFO storage.MemoryStore: ensureFreeSpace(530360) called with curMem=238151, maxMem=555755765
16/03/15 22:40:09 INFO storage.MemoryStore: Block broadcast_23 stored as values in memory (estimated size 517.9 KB, free 529.3 MB)
16/03/15 22:40:09 INFO storage.MemoryStore: ensureFreeSpace(42660) called with curMem=768511, maxMem=555755765
16/03/15 22:40:09 INFO storage.MemoryStore: Block broadcast_23_piece0 stored as bytes in memory (estimated size 41.7 KB, free 529.2 MB)
16/03/15 22:40:09 INFO storage.BlockManagerInfo: Added broadcast_23_piece0 in memory on localhost:63721 (size: 41.7 KB, free: 529.9 MB)
16/03/15 22:40:09 INFO spark.SparkContext: Created broadcast 23 from explain at <console>:28
== Physical Plan ==
Project [key#374,value#375,value#370]
 BroadcastHashJoin [key#374], [key#369], BuildLeft
  HiveTableScan [key#374,value#375], (MetastoreRelation default, tab5, None)
  Project [_1#367 AS key#369,_2#368 AS value#370]
   Scan PhysicalRDD[_1#367,_2#368]

你为什么说不必要的排序?合并联接需要对数据进行排序。在IMHO中,没有比合并联接更好的策略来执行完整的外部联接,除非您的一个数据帧小到可以广播

谢谢您的关注,但我的问题是针对两个大集合的。广播连接没有任何意义,除非其中一个连接部分足够小。当合并两个已排序集时,不需要排序。差不多是开着的。此外,对于等分区集,可以对每个分区在本地进行合并。是的,但您什么时候对数据进行了排序?您只使用连接键对其进行了重新分区,但从未执行过任何排序步骤。因此spark无法知道您的数据已排序。我的理解是,重新分区是对分区内的数据进行排序,请尝试x。show CommandRepartition不一定对数据进行排序。它仅根据密钥在节点之间分发。即使在重新分区后发现数据被排序,spark引擎也不会意识到这一点,因为没有任何排序步骤可以保证spark对数据进行排序。从某种意义上说,仅重新分区并不能对分区进行排序,这是对的。我的假设是错误的。根据文档,此功能与配置单元中的DISTRIBUTE BY相同。为了保证排序,distributed BY后面应该跟sort BY,以实现每个分区的等分区和排序状态