如何在带有Spark的Scala中使用countDistinct？_Scala_User Defined Functions_Apache Spark Sql

如何在带有Spark的Scala中使用countDistinct？

scala

如何在带有Spark的Scala中使用countDistinct？,scala,user-defined-functions,apache-spark-sql,Scala,User Defined Functions,Apache Spark Sql,我已经尝试使用countDistinct函数，该函数应根据需要在Spark 1.5中提供。但是，我得到了以下例外： Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function countDistinct; 我发现他们建议使用count和distinct函数来获得countDistinct应该产生的相同结果：不确定我是否真的理解了您的问题，但这是countDistinct聚合函数的一

我已经尝试使用countDistinct函数，该函数应根据需要在Spark 1.5中提供。但是，我得到了以下例外：

Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function countDistinct;

我发现他们建议使用count和distinct函数来获得countDistinct应该产生的相同结果：

不确定我是否真的理解了您的问题，但这是countDistinct聚合函数的一个示例：

val values = Array((1, 2), (1, 3), (2, 2), (1, 2))
val myDf = sc.parallelize(values).toDF("id", "foo")
import org.apache.spark.sql.functions.countDistinct
myDf.groupBy('id).agg(countDistinct('foo) as 'distinctFoo) show
/**
+---+-------------------+
| id|COUNT(DISTINCT foo)|
+---+-------------------+
|  1|                  2|
|  2|                  1|
+---+-------------------+
*/

countDistinct可以以两种不同的形式使用：

df.groupBy("A").agg(expr("count(distinct B)")

或

但是，如果要在自定义UDAF的同一列上使用这些方法（在Spark 1.5中实现为UserDefinedAggregateFunction），则这两种方法都不起作用：

由于这些限制，看起来最合理的方法是将countDistinct实现为UDAF，它应该允许以相同的方式处理所有函数，并与其他UDAF一起使用countDistinct

示例实现可以如下所示：

import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._

class CountDistinct extends UserDefinedAggregateFunction{
  override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil)

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    buffer(0) = (buffer.getSeq[String](0).toSet + input.getString(0)).toSeq
  }

  override def bufferSchema: StructType = StructType(
      StructField("items", ArrayType(StringType, true)) :: Nil
  )

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = (buffer1.getSeq[String](0).toSet ++ buffer2.getSeq[String](0).toSet).toSeq
  }

  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = Seq[String]()
  }

  override def deterministic: Boolean = true

  override def evaluate(buffer: Row): Any = {
    buffer.getSeq[String](0).length
  }

  override def dataType: DataType = IntegerType
}

你是说groupBy和count吗？我想把它用作：dataframe.groupBy（“colA”）.agg（expr（“countDistinct（colB）”）你能分享你的导入和你想做什么吗？它对我有用，但只适用于简单的情况。当我想在同一列上使用countDistinct和customudaf时，由于接口之间的差异，它会失败。

df.groupBy("A").agg(expr("count(distinct B)")

df.groupBy("A").agg(countDistinct("B"))

// Assume that we have already implemented and registered StdDev UDAF 
df.groupBy("A").agg(countDistinct("B"), expr("StdDev(B)"))

// Will cause
Exception in thread "main" org.apache.spark.sql.AnalysisException: StdDev is implemented based on the new Aggregate Function interface and it cannot be used with functions implemented based on the old Aggregate Function interface.;

import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._

class CountDistinct extends UserDefinedAggregateFunction{
  override def inputSchema: StructType = StructType(StructField("value", StringType) :: Nil)

  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    buffer(0) = (buffer.getSeq[String](0).toSet + input.getString(0)).toSeq
  }

  override def bufferSchema: StructType = StructType(
      StructField("items", ArrayType(StringType, true)) :: Nil
  )

  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = (buffer1.getSeq[String](0).toSet ++ buffer2.getSeq[String](0).toSet).toSeq
  }

  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = Seq[String]()
  }

  override def deterministic: Boolean = true

  override def evaluate(buffer: Row): Any = {
    buffer.getSeq[String](0).length
  }

  override def dataType: DataType = IntegerType
}