Apache spark 为什么collect_set aggregation添加了Exchange操作符来连接带扣表的查询？_Apache Spark_Apache Spark Sql_Apache Spark 2.2

Apache spark 为什么collect_set aggregation添加了Exchange操作符来连接带扣表的查询？

apache-spark

Apache spark 为什么collect_set aggregation添加了Exchange操作符来连接带扣表的查询？,apache-spark,apache-spark-sql,apache-spark-2.2,Apache Spark,Apache Spark Sql,Apache Spark 2.2,我用的是Spark-2.2。我在把斯帕克的扣子打进去。我已经创建了一个带扣的表格，下面是desc格式化的我的带扣的tbl输出： +--------------------+--------------------+-------+ | col_name| data_type|comment| +--------------------+--------------------+-------+ | bundle|

我用的是Spark-2.2。我在把斯帕克的扣子打进去。我已经创建了一个带扣的表格，下面是

desc格式化的我的带扣的tbl

输出：

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|              bundle|              string|   null|
|                 ifa|              string|   null|
|               date_|                date|   null|
|                hour|                 int|   null|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|            Database|             default|       |
|               Table|             my_bucketed_tbl|
|               Owner|            zeppelin|       |
|             Created|Thu Dec 21 13:43:...|       |
|         Last Access|Thu Jan 01 00:00:...|       |
|                Type|            EXTERNAL|       |
|            Provider|                 orc|       |
|         Num Buckets|                  16|       |
|      Bucket Columns|             [`ifa`]|       |
|        Sort Columns|             [`ifa`]|       |
|    Table Properties|[transient_lastDd...|       |
|            Location|hdfs:/user/hive/w...|       |
|       Serde Library|org.apache.hadoop...|       |
|         InputFormat|org.apache.hadoop...|       |
|        OutputFormat|org.apache.hadoop...|       |
|  Storage Properties|[serialization.fo...|       |
+--------------------+--------------------+-------+

当我执行逐组查询的解释时，我可以看到我们省去了交换阶段：

sql("select ifa,max(bundle) from my_bucketed_tbl group by ifa").explain

== Physical Plan ==
SortAggregate(key=[ifa#932], functions=[max(bundle#920)])
+- SortAggregate(key=[ifa#932], functions=[partial_max(bundle#920)])
   +- *Sort [ifa#932 ASC NULLS FIRST], false, 0
      +- *FileScan orc default.level_1[bundle#920,ifa#932] Batched: false, Format: ORC, Location: InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<bundle:string,ifa:string>

是否有任何配置我遗漏了，或者这是Spark bucketing目前存在的限制？

该问题已在版本2.2.1中修复。

您可以发现

适用于托管表。您知道是什么问题吗？你有关于JIRA的链接吗？

sql("select ifa,collect_set(bundle) from my_bucketed_tbl group by ifa").explain

== Physical Plan ==
ObjectHashAggregate(keys=[ifa#1010], functions=[collect_set(bundle#998, 0, 0)])
+- Exchange hashpartitioning(ifa#1010, 200)
   +- ObjectHashAggregate(keys=[ifa#1010], functions=[partial_collect_set(bundle#998, 0, 0)])
      +- *FileScan orc default.level_1[bundle#998,ifa#1010] Batched: false, Format: ORC, Location: InMemoryFileIndex[hdfs://ip-10-44-9-73.ec2.internal:8020/user/hive/warehouse/level_1/date_=2017-1..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<bundle:string,ifa:string>