Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 在Spark SQL中通过DISTINCT vs GROUP by删除重复项_Apache Spark_Apache Spark Sql - Fatal编程技术网

Apache spark 在Spark SQL中通过DISTINCT vs GROUP by删除重复项

Apache spark 在Spark SQL中通过DISTINCT vs GROUP by删除重复项,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,spark sql我正在使用spark sql 2.4 我有一个问题困扰了我很长一段时间-是使用DISTINCT还是groupby(不使用任何聚合)来高效地从表中删除重复项,并具有更好的查询性能 使用DISTINCT,我将使用以下命令- select distinct id, fname, lname, age from emp_table; 对于分组依据,我将只使用: select id, fname,

spark sql我正在使用spark sql 2.4

我有一个问题困扰了我很长一段时间-是使用
DISTINCT
还是
groupby
(不使用任何聚合)来高效地从表中删除重复项,并具有更好的查询性能

使用
DISTINCT
,我将使用以下命令-

select distinct 
       id, 
       fname, 
       lname, 
       age
from emp_table;
对于
分组依据
,我将只使用:

select id,
       fname,
       lname,
       age
from emp_table
group by 1, 2, 3, 4;
我在某个地方读到过关于
sparksql
的内容,即只有当数据集的
基数较高时才应使用
Distinct
,否则使用
groupby
。然而,在我的日常工作中,我发现即使在基数较低的情况下,
Duplicate
也比
groupby
表现更好

所以我的问题是,哪一个在什么情况下表现更好

有人能告诉我这件事吗。在哪些条件下,使用
Distinct
的查询比使用
Group By
的查询性能更好


谢谢

它们在功能上是等价的,将生成相同的查询计划。为清晰起见,请使用distinct。

以下是两个查询的查询计划。正如@thebluephantom所说的,它们是相同的,所以应该没有任何性能差异

create table t1 (a int, b int, c int, d int);

explain select a,b,c,d from t1 group by 1,2,3,4;
== Physical Plan ==
*(2) HashAggregate(keys=[a#14, b#15, c#16, d#17], functions=[])
+- Exchange hashpartitioning(a#14, b#15, c#16, d#17, 200), true, [id=#33]
   +- *(1) HashAggregate(keys=[a#14, b#15, c#16, d#17], functions=[])
      +- Scan hive default.t1 [a#14, b#15, c#16, d#17], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#14, b#15, c#16, d#17], Statistics(sizeInBytes=8.0 EiB)
扩展解释显示,在优化后,查询变得完全相同:

explain extended select a,b,c,d from t1 group by 1,2,3,4;
== Parsed Logical Plan ==
'Aggregate [1, 2, 3, 4], ['a, 'b, 'c, 'd]
+- 'UnresolvedRelation [t1]

== Analyzed Logical Plan ==
a: int, b: int, c: int, d: int
Aggregate [a#41, b#42, c#43, d#44], [a#41, b#42, c#43, d#44]
+- SubqueryAlias spark_catalog.default.t1
   +- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#41, b#42, c#43, d#44], Statistics(sizeInBytes=8.0 EiB)

== Optimized Logical Plan ==
Aggregate [a#41, b#42, c#43, d#44], [a#41, b#42, c#43, d#44]
+- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#41, b#42, c#43, d#44], Statistics(sizeInBytes=8.0 EiB)

== Physical Plan ==
*(2) HashAggregate(keys=[a#41, b#42, c#43, d#44], functions=[], output=[a#41, b#42, c#43, d#44])
+- Exchange hashpartitioning(a#41, b#42, c#43, d#44, 200), true, [id=#108]
   +- *(1) HashAggregate(keys=[a#41, b#42, c#43, d#44], functions=[], output=[a#41, b#42, c#43, d#44])
      +- Scan hive default.t1 [a#41, b#42, c#43, d#44], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#41, b#42, c#43, d#44], Statistics(sizeInBytes=8.0 EiB)

事实上,这表明查询引擎似乎更喜欢“分组依据”
查询。

试试。解释并告诉我们您的结论。谢谢您的回复。您能帮助我理解为什么基于上述查询计划,您说首选
group by
子句。@Matthew抱歉,如果我措辞不当,我的意思是优化后的查询看起来更像未优化的group by查询,而不是未优化的distinct查询。这只是一个与查询实际运行方式无关的观察结果。我建议在实际查询中使用distinct以提高可读性。@BluePhantom不太可能,我只是想了解这一点-这确实是我第一次运行
explain extended
不,我的意思是一般来说?你有一个流星上升,只是感兴趣。。。
explain extended select a,b,c,d from t1 group by 1,2,3,4;
== Parsed Logical Plan ==
'Aggregate [1, 2, 3, 4], ['a, 'b, 'c, 'd]
+- 'UnresolvedRelation [t1]

== Analyzed Logical Plan ==
a: int, b: int, c: int, d: int
Aggregate [a#41, b#42, c#43, d#44], [a#41, b#42, c#43, d#44]
+- SubqueryAlias spark_catalog.default.t1
   +- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#41, b#42, c#43, d#44], Statistics(sizeInBytes=8.0 EiB)

== Optimized Logical Plan ==
Aggregate [a#41, b#42, c#43, d#44], [a#41, b#42, c#43, d#44]
+- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#41, b#42, c#43, d#44], Statistics(sizeInBytes=8.0 EiB)

== Physical Plan ==
*(2) HashAggregate(keys=[a#41, b#42, c#43, d#44], functions=[], output=[a#41, b#42, c#43, d#44])
+- Exchange hashpartitioning(a#41, b#42, c#43, d#44, 200), true, [id=#108]
   +- *(1) HashAggregate(keys=[a#41, b#42, c#43, d#44], functions=[], output=[a#41, b#42, c#43, d#44])
      +- Scan hive default.t1 [a#41, b#42, c#43, d#44], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#41, b#42, c#43, d#44], Statistics(sizeInBytes=8.0 EiB)
explain extended select distinct a,b,c,d from t1;
== Parsed Logical Plan ==
'Distinct
+- 'Project ['a, 'b, 'c, 'd]
   +- 'UnresolvedRelation [t1]

== Analyzed Logical Plan ==
a: int, b: int, c: int, d: int
Distinct
+- Project [a#50, b#51, c#52, d#53]
   +- SubqueryAlias spark_catalog.default.t1
      +- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#50, b#51, c#52, d#53], Statistics(sizeInBytes=8.0 EiB)

== Optimized Logical Plan ==
Aggregate [a#50, b#51, c#52, d#53], [a#50, b#51, c#52, d#53]
+- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#50, b#51, c#52, d#53], Statistics(sizeInBytes=8.0 EiB)

== Physical Plan ==
*(2) HashAggregate(keys=[a#50, b#51, c#52, d#53], functions=[], output=[a#50, b#51, c#52, d#53])
+- Exchange hashpartitioning(a#50, b#51, c#52, d#53, 200), true, [id=#133]
   +- *(1) HashAggregate(keys=[a#50, b#51, c#52, d#53], functions=[], output=[a#50, b#51, c#52, d#53])
      +- Scan hive default.t1 [a#50, b#51, c#52, d#53], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#50, b#51, c#52, d#53], Statistics(sizeInBytes=8.0 EiB)