Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/objective-c/23.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Spark,UDAF可变聚合缓冲区与MapType(IntegerType->;Tuple2(DoubleType,LontType))_Apache Spark_Spark Dataframe - Fatal编程技术网

Apache spark Spark,UDAF可变聚合缓冲区与MapType(IntegerType->;Tuple2(DoubleType,LontType))

Apache spark Spark,UDAF可变聚合缓冲区与MapType(IntegerType->;Tuple2(DoubleType,LontType)),apache-spark,spark-dataframe,Apache Spark,Spark Dataframe,我正在使用UDAF在数据帧上定制reduceByKey, 基于nexts链接,目标是通过键获取(累加器、计数) 该数据是一个数据帧键值: +----+------+ |键值| +----+------+ | 1 | 500.0| | 2 | 250.0| | 3 | 350.0| | 1 | 250.0| | 2 | 150.0| +----+------+ 从这里获取一些代码: 下一个代码是实现,使用两个映射: 累加器的MapType(IntegerType->(Double

我正在使用UDAF在数据帧上定制reduceByKey, 基于nexts链接,目标是通过键获取(累加器、计数)

该数据是一个数据帧键值:

+----+------+
|键值|
+----+------+
| 1  | 500.0|
| 2  | 250.0|
| 3  | 350.0|
| 1  | 250.0| 
| 2  | 150.0|
+----+------+

从这里获取一些代码:

下一个代码是实现,使用两个映射:

累加器的MapType(IntegerType->(DoubleType))

计数器的映射类型(IntegerType->(LontType))

现在,我想只使用一个映射或任何可以存储两个数字的结构来存储两个值:

1) 映射类型(IntegerType->Tuple2(DoubleType,LontType)),但Tuple2不是sql类型

2) 具有:case类acuCount(acu:Double,count:Long)但不具有acuCount的映射是sql类型

3) 数组类型(双重类型)

4) 或者任何可以存储两个数字的结构

然后要返回映射,或者如果可能,返回另一个数据帧:

+----+-------+-------+
|键|累计|计数|
+----+-------+-------+
| 1  | 750.0 |  2    |
| 2  | 400.0 |  2    |
| 3  | 350.0 |  1    |
+----+-------+-------+

下一个代码有两个映射,但不完整,因为只返回一个:

class GroupByAccCount extends org.apache.spark.sql.expressions.UserDefinedAggregateFunction {

 // Input Data Type Schema: key,value
  def inputSchema = new org.apache.spark.sql.types.StructType().add("k", org.apache.spark.sql.types.IntegerType).add("v", org.apache.spark.sql.types.DoubleType)

  // Intermediate Schema: map(key:Integer,value:Double), map(key:Integer,value:Long)
  def bufferSchema: org.apache.spark.sql.types.StructType = org.apache.spark.sql.types.StructType(org.apache.spark.sql.types.StructField("values",  org.apache.spark.sql.types.MapType(org.apache.spark.sql.types.IntegerType, org.apache.spark.sql.types.DoubleType))::
      org.apache.spark.sql.types.StructField("values",  org.apache.spark.sql.types.MapType(org.apache.spark.sql.types.IntegerType, org.apache.spark.sql.types.LongType)):: Nil)



  def deterministic: Boolean = true

  def initialize(buffer: org.apache.spark.sql.expressions.MutableAggregationBuffer): Unit = {
    buffer(0) = Map()
    buffer(1) = Map()
    // buffer(1)= map(groupid count)
  }

  //Sequence OP
  def update(buffer: org.apache.spark.sql.expressions.MutableAggregationBuffer, row:  org.apache.spark.sql.Row) : Unit = {
    //Row
        val key = row.getAs[Int](0)
        val value = row.getAs[Double](1)
    //Buffer(0) Map key->Acummulator
        var mpAccum = buffer.getAs[Map[Int,Double]](0)
        var v:Double = mpAccum.getOrElse(key, 0.0)
        v= v + value
        mpAccum = mpAccum  + (key -> v)
        buffer(0) = mpAccum
    //Buffer(1) Map key->Counter
        var mpCount = buffer.getAs[Map[Int,Long]](1)
        var c:Long = mpCount.getOrElse(key, 0)
        mpCount = mpCount  + (key -> (c + 1L))
        buffer(1) = mpCount


  }

  //Combine Op
  // Merge two partial aggregates
  def merge(buffer1: org.apache.spark.sql.expressions.MutableAggregationBuffer, buffer2:  org.apache.spark.sql.Row) : Unit = {
    //Buffer(0) Map key->Acummulator
    var mpAccum1 = buffer1.getAs[Map[Int,Double]](0)
    var mpAccum2 = buffer2.getAs[Map[Int,Double]](0)
    mpAccum2 foreach {
        case (k ,v) => {
            var c:Double = mpAccum1.getOrElse(k, 0.0)
            //c = c + v
            mpAccum1 = mpAccum1 + (k -> (c + v))
        }
    }
    buffer1(0) = mpAccum1
    //Buffer(1) Map key->Counter 
    var mpCounter1 = buffer1.getAs[Map[Int,Long]](1)
    var mpCounter2 = buffer2.getAs[Map[Int,Long]](1)
    mpCounter2 foreach {
        case (k ,v) => {
            var c:Long = mpCounter1.getOrElse(k, 0)
            //c = c + v
            mpCounter1 = mpCounter1 + (k -> (c + v))
        }
    }
    buffer1(1) = mpCounter1
   }


   // Returned Data Type: 
  def dataType: org.apache.spark.sql.types.DataType = org.apache.spark.sql.types.MapType(org.apache.spark.sql.types.IntegerType, org.apache.spark.sql.types.DoubleType)//, org.apache.spark.sql.types.MapType(org.apache.spark.sql.types.IntegerType, org.apache.spark.sql.types.LongType) 


  def evaluate(buffer: org.apache.spark.sql.Row): Any = {
      buffer.getAs[Map[Int,Double]](0)
      //buffer.getAs[Map[Int,Long]](1))
      //Here want to return one map : key->(acc,count) or another dataframe


  }
}