Scala 如何在UDAF中的MutableAggregationBuffer中添加/更改映射对象?

Scala 如何在UDAF中的MutableAggregationBuffer中添加/更改映射对象?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我使用Spark 2.0.1和Scala 2.11 这是一个与Spark中的用户定义聚合函数(UDAF)相关的问题。我使用提供的示例答案来问我的问题: import org.apache.spark.sql.expressions._ import org.apache.spark.sql.types._ import org.apache.spark.sql.functions.udf import org.apache.spark.sql.{Row, Column} object Dumm

我使用Spark 2.0.1和Scala 2.11

这是一个与Spark中的用户定义聚合函数(UDAF)相关的问题。我使用提供的示例答案来问我的问题:

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.{Row, Column}

object DummyUDAF extends UserDefinedAggregateFunction {
  def inputSchema = new StructType().add("x", StringType)
  def bufferSchema = new StructType()
    .add("buff", ArrayType(LongType))
    .add("buff2", ArrayType(DoubleType))
  def dataType = new StructType()
    .add("xs", ArrayType(LongType))
    .add("ys", ArrayType(DoubleType))
  def deterministic = true 
  def initialize(buffer: MutableAggregationBuffer) = {}
  def update(buffer: MutableAggregationBuffer, input: Row) = {}
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {}
  def evaluate(buffer: Row) = (Array(1L, 2L, 3L), Array(1.0, 2.0, 3.0))
}
我可以轻松地返回多个
Map
s,而不是
数组
,但无法在
更新
方法中对Map进行变异

import org.apache.spark.sql.expressions_
导入org.apache.spark.sql.types_
导入org.apache.spark.sql.functions.udf
导入org.apache.spark.sql.{行,列}
导入scala.collection.mutable.Map
对象DummyUDAF扩展了UserDefinedAggregateFunction{
def inputSchema=new StructType().add(“x”,DoubleType).add(“y”,IntegerType)
def bufferSchema=新结构类型()
.add(“buff”,映射类型(DoubleType,IntegerType))
.add(“buff2”,映射类型(DoubleType,IntegerType))
def dataType=new StructType()
.add(“xs”,映射类型(DoubleType,IntegerType))
.add(“ys”,映射类型(DoubleType,IntegerType))
def deterministic=true
def初始化(缓冲区:可变聚合缓冲区)={
缓冲区(0)=scala.collection.mutable.Map[Double,Int]()
缓冲区(1)=scala.collection.mutable.Map[Double,Int]()
}
def更新(缓冲区:MutableAggregationBuffer,输入:行):单位={
缓冲区(0).asInstanceOf[Map[Double,Int]](input.getDouble(0))=input.getInt(1)
缓冲区(1).asInstanceOf[Map[Double,Int]](input.getDouble(0)*10)=input.getInt(1)*10
}
def合并(buffer1:MutableAggregationBuffer,buffer2:Row)={
buffer1(0).asInstanceOf[Map[Double,Int]+=buffer2(0).asInstanceOf[Map[Double,Int]]
buffer1(1).asInstanceOf[Map[Double,Int]+=buffer2(1).asInstanceOf[Map[Double,Int]]
}
//def评估(缓冲区:行)=(映射(1.0->10,2.0->20),映射(10.0->100,11.0->110))
def evaluate(buffer:Row)=(buffer(0).asInstanceOf[Map[Double,Int]],buffer(1).asInstanceOf[Map[Double,Int]])
}
这可以很好地编译,但会出现运行时错误:

val df=Seq((1.0,1)、(2.0,2)).toDF(“k”、“v”)
df.select(DummyUDAF($“k”,“$“v”)).show(1,false)
org.apache.spark.SparkException:作业因阶段失败而中止:阶段70.0中的任务1失败4次,最近的失败:阶段70.0中的任务1.3丢失(TID 204,10.91.252.25):java.lang.ClassCastException:scala.collection.immutable.Map$EmptyMap$无法转换为scala.collection.mutable.Map
讨论的另一个解决方案表明,这可能是由于
MapType StructType
引起的问题。然而,当我尝试上面提到的解决方案时,我仍然得到相同的错误

val distudaf=新的distinctValue
val df=序列((“a”、“a1”)、(“a”、“a1”)、(“a”、“a2”)、(“b”、“b1”)、(“b”、“b2”)、(“b”、“b3”)、(“b”、“b1”)、(“b”、“b1”)。toDF(“col1”、“col2”)
df.groupBy(“col1”).agg(distudaf($“col2”).as(“DV”).show
org.apache.spark.SparkException:作业因阶段失败而中止:阶段22.0中的任务1失败4次,最近的失败:阶段22.0中的任务1.3丢失(TID 100,10.91.252.25):java.lang.ClassCastException:scala.collection.immutable.Map$EmptyMap$无法转换为scala.collection.mutable.Map

我的偏好是对映射进行变异,因为我认为映射是巨大的,复制和重新分配可能会导致性能/内存瓶颈)

我对UDAF的有限理解是,您应该只设置您想要(语义)更新的内容,也就是说,将
MutableAggregationBuffer
中已经设置的内容与您想要添加的内容结合起来,然后……
=
它(它将调用
更新(i:Int,value:Any):封面下的单位

您的代码可以如下所示:

def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
  val newBuffer0 = buffer(0).asInstanceOf[Map[Double, Int]]
  buffer(0) = newBuffer0 + (input.getDouble(0) -> input.getInt(1))

  val newBuffer1 = buffer(1).asInstanceOf[Map[Double, Int]]
  buffer(1) = newBuffer1 + (input.getDouble(0) * 10 -> input.getInt(1) * 10)
}
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.{Row, Column}

object DummyUDAF extends UserDefinedAggregateFunction {
  def inputSchema = new StructType().add("x", DoubleType).add("y", IntegerType)
  def bufferSchema = new StructType()
    .add("buff", MapType(DoubleType, IntegerType))
    .add("buff2", MapType(DoubleType, IntegerType))

  def dataType = new StructType()
    .add("xs", MapType(DoubleType, IntegerType))
    .add("ys", MapType(DoubleType, IntegerType))

  def deterministic = true 

  def initialize(buffer: MutableAggregationBuffer) = {
    buffer(0) = Map[Double,Int]()
    buffer(1) = Map[Double,Int]()
  }

  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    val newBuffer0 = buffer(0).asInstanceOf[Map[Double, Int]]
    buffer(0) = newBuffer0 + (input.getDouble(0) -> input.getInt(1))

    val newBuffer1 = buffer(1).asInstanceOf[Map[Double, Int]]
    buffer(1) = newBuffer1 + (input.getDouble(0) * 10 -> input.getInt(1) * 10)
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
    buffer1(0) = buffer1(0).asInstanceOf[Map[Double,Int]] ++ buffer2(0).asInstanceOf[Map[Double,Int]]
    buffer1(1) = buffer1(1).asInstanceOf[Map[Double,Int]] ++ buffer2(1).asInstanceOf[Map[Double,Int]]
  }

  //def evaluate(buffer: Row) = (Map(1.0->10,2.0->20), Map(10.0->100,11.0->110))
  def evaluate(buffer: Row) = (buffer(0).asInstanceOf[Map[Double,Int]], buffer(1).asInstanceOf[Map[Double,Int]])
}
完整的
DummyUDAF
可以如下所示:

def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
  val newBuffer0 = buffer(0).asInstanceOf[Map[Double, Int]]
  buffer(0) = newBuffer0 + (input.getDouble(0) -> input.getInt(1))

  val newBuffer1 = buffer(1).asInstanceOf[Map[Double, Int]]
  buffer(1) = newBuffer1 + (input.getDouble(0) * 10 -> input.getInt(1) * 10)
}
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.{Row, Column}

object DummyUDAF extends UserDefinedAggregateFunction {
  def inputSchema = new StructType().add("x", DoubleType).add("y", IntegerType)
  def bufferSchema = new StructType()
    .add("buff", MapType(DoubleType, IntegerType))
    .add("buff2", MapType(DoubleType, IntegerType))

  def dataType = new StructType()
    .add("xs", MapType(DoubleType, IntegerType))
    .add("ys", MapType(DoubleType, IntegerType))

  def deterministic = true 

  def initialize(buffer: MutableAggregationBuffer) = {
    buffer(0) = Map[Double,Int]()
    buffer(1) = Map[Double,Int]()
  }

  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    val newBuffer0 = buffer(0).asInstanceOf[Map[Double, Int]]
    buffer(0) = newBuffer0 + (input.getDouble(0) -> input.getInt(1))

    val newBuffer1 = buffer(1).asInstanceOf[Map[Double, Int]]
    buffer(1) = newBuffer1 + (input.getDouble(0) * 10 -> input.getInt(1) * 10)
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
    buffer1(0) = buffer1(0).asInstanceOf[Map[Double,Int]] ++ buffer2(0).asInstanceOf[Map[Double,Int]]
    buffer1(1) = buffer1(1).asInstanceOf[Map[Double,Int]] ++ buffer2(1).asInstanceOf[Map[Double,Int]]
  }

  //def evaluate(buffer: Row) = (Map(1.0->10,2.0->20), Map(10.0->100,11.0->110))
  def evaluate(buffer: Row) = (buffer(0).asInstanceOf[Map[Double,Int]], buffer(1).asInstanceOf[Map[Double,Int]])
}

晚会迟到了。我刚刚发现一个人可以使用

override def bufferSchema: StructType = StructType(List(
    StructField("map", ObjectType(classOf[mutable.Map[String, Long]]))
))

在缓冲区中使用
mutable.Map

值得指出的是,每次调用
update
时都会复制两次数据,因此并不能真正解决问题。请给出完整的示例?我导入了scala.collection.mutable,并尝试了您的解决方案。但是,在运行时出错:
org.apache.spark.SparkException:不支持的数据类型ObjectType(接口scala.collection.mutable.Map)