Apache spark spark 3.0-spark聚合函数给出了与预期不同的表达式_Apache Spark_Apache Spark Sql

Apache spark spark 3.0-spark聚合函数给出了与预期不同的表达式

apache-spark

Apache spark spark 3.0-spark聚合函数给出了与预期不同的表达式,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,问题：对于sumDistinct表达式->总和（DISTINCT x）但对于countDistinct表达式->count（x）这是某种缺陷还是功能？注意：countDistinct在spark版本计数（Distinct x）正如@Shaido在评论部分提到的那样。。。我已经验证了一些东西来指出toString中最新版本的spark代码中存在一些bug。（这可能是一个错误或功能，我不完全确定）火花代码版本=3.X中我已经检查了源代码，toString行为有点不同 def sq

问题：

对于sumDistinct表达式->总和（DISTINCT x）
但对于countDistinct表达式->count（x）

这是某种缺陷还是功能？

注意：countDistinct在spark版本<3.0中给出了正确的表达式->计数（Distinct x）

正如@Shaido在评论部分提到的那样。。。我已经验证了一些东西来指出toString中最新版本的spark代码中存在一些bug。（这可能是一个错误或功能，我不完全确定）

火花代码版本<3.X

/Downloads/spark-3.0.1-bin-hadoop2.7/bin$ ./spark-shell


20/09/23 10:58:45 WARN Utils: Your hostname, byte-nihal resolves to a loopback address: 127.0.1.1; using 192.168.2.103 instead (on interface enp2s0)
20/09/23 10:58:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/09/23 10:58:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.2.103:4040
Spark context available as 'sc' (master = local[*], app id = local-1600838949311).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_265)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> println(countDistinct("x"))
count(x)

scala> println(sumDistinct("x"))
sum(DISTINCT x)

scala> println(sum("x"))
sum(x)

scala> println(count("x"))
count(x)

如果我们特别检查countDistinct（“x”）的源代码

正如您在第二个重载方法中所看到的那样，使用了Count.apply聚合函数和isDistinct=true作为不同的值进行计数

  def countDistinct(columnName: String, columnNames: String*): Column =
    countDistinct(Column(columnName), columnNames.map(Column.apply) : _*)
 
  def countDistinct(expr: Column, exprs: Column*): Column = {
    withAggregateFunction(Count.apply((expr +: exprs).map(_.expr)), isDistinct = true)
  }

如果您特别使用AggregateFunction检查，它将返回type，如果您检查列的toString方法

private def withAggregateFunction( func: AggregateFunction, isDistinct: Boolean = false): Column = { Column(func.toAggregateExpression(isDistinct)) }
它在上调用.sql方法
AggregateExpression根据下面的代码回调aggregateFunction的sql方法

override def sql:String=aggregateFunction.sql（isDistinct）
在我们的例子中，AggregateFuncion是Count

def toPrettySQL(e: Expression): String = usePrettyExpression(e).sql
根据上面的代码，它应该返回count（不同的x）
现在，在spark版本>=3.X中我已经检查了源代码，toString行为有点不同

def sql(isDistinct: Boolean): String = { val distinct = if (isDistinct) "DISTINCT " else "" s"$prettyName($distinct${children.map(_.sql).mkString(", ")})" }
它现在使用UnsolvedFunction而不是withAggregateFunction
在toString中，方法非常简单，如下所示

@scala.annotation.varargs def countDistinct(expr: Column, exprs: Column*): Column = // For usage like countDistinct("*"), we should let analyzer expand star and // resolve function. Column(UnresolvedFunction("count", (expr +: exprs).map(_.expr), isDistinct = true))

哪个打印计数（x）。。这就是为什么输出为count（x）
它可能与
countDistinct
的实现方式有关，请参阅。它需要允许多个输入列以及
“*”
，而
sum
和
sumdinct
在同一时间仅对单个列起作用。特别要注意的是，
countDistinct
中的
unsolvedFunction
被命名为
count
，这可能是原因吗？@Shaido它在2.4 sparkNo中实现了相同的功能，2.4版本看起来是这样的。不使用
unsolvedFunction
。此合并带来的更改是为了允许
“*”
输入：ok my bad。它的实施方式不同
@scala.annotation.varargs def countDistinct(expr: Column, exprs: Column*): Column = // For usage like countDistinct("*"), we should let analyzer expand star and // resolve function. Column(UnresolvedFunction("count", (expr +: exprs).map(_.expr), isDistinct = true))

override def toString: String = s"'$name(${children.mkString(", ")})"