Apache spark 在Apache spark SQL中，count distinct是如何工作的_Apache Spark_Apache Spark Sql

Apache spark 在Apache spark SQL中，count distinct是如何工作的

apache-spark

Apache spark 在Apache spark SQL中，count distinct是如何工作的,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我试图计算不同日期范围内不同数量的实体我需要了解spark是如何执行此操作的 val distinct_daily_cust_12month = sqlContext.sql(s"select distinct day_id,txn_type,customer_id from ${db_name}.fact_customer where day_id>='${start_last_12month}' and day_id<='${start_date}' and txn_type

我试图计算不同日期范围内不同数量的实体

我需要了解spark是如何执行此操作的

val distinct_daily_cust_12month = sqlContext.sql(s"select distinct day_id,txn_type,customer_id from ${db_name}.fact_customer where day_id>='${start_last_12month}' and day_id<='${start_date}' and txn_type not in (6,99)")

val category_mapping = sqlContext.sql(s"select * from datalake.category_mapping");

val daily_cust_12month_ds =distinct_daily_cust_12month.join(broadcast(category_mapping),distinct_daily_cust_12month("txn_type")===category_mapping("id")).select("category","sub_category","customer_id","day_id")

daily_cust_12month_ds.createOrReplaceTempView("daily_cust_12month_ds")

val total_cust_metrics = sqlContext.sql(s"""select 'total' as category,
count(distinct(case when day_id='${start_date}' then customer_id end)) as yest,
count(distinct(case when day_id>='${start_week}' and day_id<='${end_week}' then customer_id end)) as week,
count(distinct(case when day_id>='${start_month}' and day_id<='${start_date}' then customer_id end)) as mtd,
count(distinct(case when day_id>='${start_last_month}' and day_id<='${end_last_month}' then customer_id end)) as ltd,
count(distinct(case when day_id>='${start_last_6month}' and day_id<='${start_date}' then customer_id end)) as lsm,
count(distinct(case when day_id>='${start_last_12month}' and day_id<='${start_date}' then customer_id end)) as ltm
from daily_cust_12month_ds
""")

val distinct\u daily\u cust\u 12month=sqlContext.sql“从${db\u name}中选择不同的日期id、txn\u类型、客户id。事实\u customer其中day\u id>='${start\u last\u 12month}和day\u id='${start\u week}和day\u id='${start\u month}和day{id='${start\u last\u month}“和day_idCount distinct通过对数据进行哈希分区，然后按分区对不同元素进行计数，最后对计数进行求和来工作。一般来说，由于完全洗牌，这是一项繁重的操作，在Spark或最有可能的任何完全分布式系统中都没有银弹，使用distinct的操作在本质上是困难的o在分布式系统中求解
在某些情况下，有更快的方法可以做到这一点：

如果近似值是可接受的，则通常会比它所基于的速度快得多，并且要洗牌的数据量要比使用精确实现时少得多
如果您可以将管道设计为数据源已经分区，这样分区之间就不会有任何重复，那么就不需要对数据帧进行哈希分区这一缓慢的步骤

注意：要了解如何计算不同的工作，您可以始终使用explain
：
df.select(countDistinct("foo")).explain()


示例输出：
== Physical Plan ==
*(3) HashAggregate(keys=[], functions=[count(distinct foo#3)])
+- Exchange SinglePartition
   +- *(2) HashAggregate(keys=[], functions=[partial_count(distinct foo#3)])
      +- *(2) HashAggregate(keys=[foo#3], functions=[])
         +- Exchange hashpartitioning(foo#3, 200)
            +- *(1) HashAggregate(keys=[foo#3], functions=[])
               +- LocalTableScan [foo#3]

谢谢ollik1。我想我无法进一步调整它。另外，当我使用explain或toDebugString时，计划通常会被截断显示为“…”，有没有办法查看完整的计划