Scala 编写接受任何可订购spark数据类型的聚合器

Scala 编写接受任何可订购spark数据类型的聚合器,scala,apache-spark,Scala,Apache Spark,我正在学习自定义Spark聚合器,并尝试实现一个“MinN”函数,该函数返回列中N个最小项的数组。我希望它适用于整数、小数和时间戳 这仅适用于双人: case class MinN(col: String, cutoff: Int = 5) extends Aggregator[Row, ArrayBuffer[Double], ArrayBuffer[Double]] with Serializable { def zero = ArrayBuffer[Double]() de

我正在学习自定义Spark聚合器,并尝试实现一个“MinN”函数,该函数返回列中N个最小项的数组。我希望它适用于整数、小数和时间戳

这仅适用于双人:

case class MinN(col: String, cutoff: Int = 5)
  extends Aggregator[Row, ArrayBuffer[Double], ArrayBuffer[Double]] with Serializable {

  def zero =  ArrayBuffer[Double]()
  def reduce(acc: ArrayBuffer[Double], x: Row) = {
    val curval = x.getAs[Double](col)
    if (acc.length < cutoff){
      acc.append(curval)
    } else {
      val maxOfMins = acc.max
      if (curval < maxOfMins) {
        acc(acc.indexOf(maxOfMins)) = curval
      }
    }
    acc
  }

  def merge(acc1: ArrayBuffer[Double], acc2: ArrayBuffer[Double]) = ({
    (acc1 ++ acc2).sorted.take(cutoff)
  })

  def finish(acc: ArrayBuffer[Double]) = acc

  override def bufferEncoder: Encoder[ArrayBuffer[Double]] = ExpressionEncoder()
  override def outputEncoder: Encoder[Option[Double]] = ExpressionEncoder()
}
我觉得我错过了一些基本的东西。我甚至不想让
MinN
函数像这样参数化(因此调用者必须编写
MinN[Double]
。我想创建类似内置
min
函数的东西,它保留输入的(spark)数据类型

编辑

我使用的MinN聚合器如下所示:

  val minVolume = new MinN[Double]("volume").toColumn
  val p = dataframe.agg(minVolume.name("minVolume"))

我相信spark不能处理这种高级抽象。你可以将聚合转换成这样的东西

case class MinN[T : Ordering](cutoff: Int = 5)(
  implicit arrEnc: Encoder[mutable.ArrayBuffer[T]])
  extends Aggregator[T, mutable.ArrayBuffer[T], mutable.ArrayBuffer[T]] with Serializable {

  def zero =  mutable.ArrayBuffer[T]()
  def reduce(acc: mutable.ArrayBuffer[T], x: T) = {
    mutable.ArrayBuffer.empty
  }

  def merge(acc1: mutable.ArrayBuffer[T], acc2: mutable.ArrayBuffer[T]) = ({
    mutable.ArrayBuffer.empty
  })

  def finish(acc: mutable.ArrayBuffer[T]) = acc

  override def bufferEncoder: Encoder[mutable.ArrayBuffer[T]] = implicitly
  override def outputEncoder: Encoder[mutable.ArrayBuffer[T]] = implicitly
}
并将编译,您缺少编码器,因此它们在构造函数中被提取。但在以下示例中使用它:

val spark=SparkSession.builder().appName(“jander”).master(“local[1]”)。getOrCreate()

将在运行时引发异常

Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate [col1#10], [col1#10, minn(MinN(2), None, None, None, newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS value#1, mapobjects(MapObjects_loopValue0, false, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue0, false, DoubleType, false)), input[0, array<double>, false], Some(class scala.collection.immutable.List)), newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS value#0, StructField(value,ArrayType(DoubleType,false),false), true, 0, 0)[col2] AS a#16];;
线程“main”org.apache.spark.sql.AnalysisException中的异常:未解析运算符的聚合[col1#10],[col1#10,minn(minn(2),None,None,None,newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData)作为值#1,mapobjects(mapobjects#loopValue0,false,DoubleType,assertnotnull(davariable(MapObjects_loopValue0,false,DoubleType,false)),输入[0,数组,false],一些(类scala.collection.immutable.List)),newInstance(类org.apache.spark.sql.catalyst.util.GenericArrayData)作为值#0,StructField(值,ArrayType(DoubleType,false),false),true,0,0)[col2]作为#16];;

你能举例说明你是如何创建MinN[T]版本的实例的吗?@Alfilercio我举了一个例子。你的回答与之前的编译错误相同。
import spark.implicits._

val custom = MinN[Double](2).toColumn

val d: Double = 1.1

val df = List(
  ("A", 1.1),
  ("A", 1.2),
  ("A", 1.3),
  ).toDF("col1", "col2")

df.groupBy("col1").agg(custom("col2") as "a").show()
Exception in thread "main" org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate [col1#10], [col1#10, minn(MinN(2), None, None, None, newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS value#1, mapobjects(MapObjects_loopValue0, false, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue0, false, DoubleType, false)), input[0, array<double>, false], Some(class scala.collection.immutable.List)), newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS value#0, StructField(value,ArrayType(DoubleType,false),false), true, 0, 0)[col2] AS a#16];;