spark定制聚合器>=2.0(scala)
我有以下数据集:spark定制聚合器>=2.0(scala),scala,apache-spark,aggregate-functions,Scala,Apache Spark,Aggregate Functions,我有以下数据集: val myDS = List(("a",1,1.1), ("b",2,1.2), ("a",3,3.1), ("b",4,1.4), ("a",5,5.1)).toDS // and aggregation // myDS.groupByKey(t2 => t2._1).agg(myAvg).collect() 我想编写自定义聚合函数myAvg,它接受Tuple3参数并返回sum(u._2)/sum(u._3)。 我知道,它可以用其他方式计算,但我想编写自定义聚合 我
val myDS = List(("a",1,1.1), ("b",2,1.2), ("a",3,3.1), ("b",4,1.4), ("a",5,5.1)).toDS
// and aggregation
// myDS.groupByKey(t2 => t2._1).agg(myAvg).collect()
我想编写自定义聚合函数myAvg
,它接受Tuple3参数并返回sum(u._2)/sum(u._3)
。
我知道,它可以用其他方式计算,但我想编写自定义聚合
我写了这样的东西:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{Encoder, Encoders}
val myAvg = new Aggregator[Tuple3[String, Integer, Double],
Tuple2[Integer,Double],
Double] {
def zero: Tuple2[Integer,Double] = Tuple2(0,0.0)
def reduce(agg: Tuple2[Integer,Double],
a: Tuple3[String, Integer,Double]): Tuple2[Integer,Double] =
Tuple2(agg._1 + a._2, agg._2 + a._3)
def merge(agg1: Tuple2[Integer,Double],
agg2: Tuple2[Integer,Double]): Tuple2[Integer,Double] =
Tuple2(agg1._1 + agg2._1, agg1._2 + agg2._2)
def finish(res: Tuple2[Integer,Double]): Double = res._1/res._2
def bufferEncoder: Encoder[(Integer, Double)] =
Encoders.tuple(Encoders.INT, Encoders.scalaDouble)
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}.toColumn()
很遗憾,我收到以下错误:
java.lang.RuntimeException: Unsupported literal type class scala.runtime.BoxedUnit ()
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:75)
at org.apache.spark.sql.functions$.lit(functions.scala:101)
at org.apache.spark.sql.Column.apply(Column.scala:217)
怎么了
在我的本地Spark 2.1中,我收到一条警告
warning: there was one deprecation warning; re-run with -deprecation for details
我的代码中有哪些内容不推荐
感谢您的建议。这里的问题似乎是您使用Java的
Integer
而不是Scala的Int
——如果您将聚合器实现中Integer
的所有用法都替换为Int
(并将编码器.Int
替换为编码器.scalaInt
)-如预期的那样:
val myAvg: TypedColumn[(String, Int, Double), Double] = new Aggregator[(String, Int, Double), (Int, Double), Double] {
def zero: (Int, Double) = Tuple2(0,0.0)
def reduce(agg: (Int, Double), a: (String, Int, Double)): (Int, Double) =
(agg._1 + a._2, agg._2 + a._3)
def merge(agg1: (Int, Double), agg2: (Int, Double)): (Int, Double) =
(agg1._1 + agg2._1, agg1._2 + agg2._2)
def finish(res: (Int, Double)): Double = res._1/res._2
def bufferEncoder: Encoder[(Int, Double)] =
Encoders.tuple(Encoders.scalaInt, Encoders.scalaDouble)
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}.toColumn
(还应用了一些语法修饰,删除了显式的
Tuble
references) 一分钟后。。。但是,错误消息是正确的。。。问题出在column()
=>column