Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 自定义Spark聚合器返回行_Apache Spark_Apache Spark Sql_Aggregate Functions_User Defined Functions - Fatal编程技术网

Apache spark 自定义Spark聚合器返回行

Apache spark 自定义Spark聚合器返回行,apache-spark,apache-spark-sql,aggregate-functions,user-defined-functions,Apache Spark,Apache Spark Sql,Aggregate Functions,User Defined Functions,我试图修改中的示例以处理任意行。目标是返回组的“最新”行 聚合器是这样实现的 class Latest(val f: Row => String, val schema: StructType) extends Aggregator[Row, (String, Row), Row] { override def zero: (String, Row) = ("0000-00-00", null) override def reduce(b: (String, R

我试图修改中的示例以处理任意行。目标是返回组的“最新”行

聚合器是这样实现的

class Latest(val f: Row => String, val schema: StructType) extends Aggregator[Row, (String, Row), Row] {
  override def zero: (String, Row) = ("0000-00-00", null)
  override def reduce(b: (String, Row), a: Row): (String, Row) = merge(b, (f(a), a))
  override def merge(b1: (String, Row), b2: (String, Row)): (String, Row) = Seq(b1, b2).maxBy(_._1)
  override def finish(reduction: (String, Row)): Row = reduction._2

  override def bufferEncoder: Encoder[(String, Row)] = Encoders.product[(String, Row)]
  override def outputEncoder: Encoder[Row] = RowEncoder(schema)
}

我正在用以下代码测试这个聚合器

class AggregatorSpec
    extends FunSpec
    with DataFrameComparer
    with SparkSessionTestWrapper {

  import spark.implicits._

  describe("main") {

    it("works") {

        val spark = SparkSession
          .builder
          .master("local")
          .appName("common typed aggregator implementations")
          .getOrCreate()

        val df = Seq(
          ("ham", "2019-01-01", 3L, "Yah"),
          ("cheese", "2018-12-31", 4L, "Woo"),       
          ("fish", "2019-01-02", 5L, "Hah"),
          ("grain", "2019-01-01", 6L, "Community"),
          ("grain", "2019-01-02", 7L, "Community"),
          ("ham", "2019-01-04", 3L, "jamón")
        ).toDF("Key", "Date", "Numeric", "Text")

        println("input data:")
        df.show()

        println("running latest:")
        df.groupByKey(_.getString(0)).agg(new Latest(_.getString(1), ds.schema).toColumn).show()

        spark.stop()
    }
  }
}
运行上面的代码会产生以下错误

[info] - runs *** FAILED ***
[info]   java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
[info] - field (class: "org.apache.spark.sql.Row", name: "_2")
[info] - root class: "scala.Tuple2"
[info]   at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:625)
[info]   at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
[info]   at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
[info]   at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info]   at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info]   at scala.collection.immutable.List.foreach(List.scala:381)
[info]   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
[info]   at scala.collection.immutable.List.flatMap(List.scala:344)
[info]   at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
[info]   at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)

我对Spark和Scala都比较陌生,我甚至不确定我想要实现的目标是否可能。

问题在于创建
bufferEncoder
- 对此进行修改

override def bufferEncoder:Encoder[(字符串,行)]=Encoders.tuple(Encoders.String,行编码器(模式))
我希望这是一个简单的示例,您希望尝试使用
聚合器。
如果没有,则有一种替代方法可以在没有聚合器的情况下实现相同的结果

df.groupBy(“Key”).agg(最大值(结构(“日期”、“数字”、“文本”、“Key”))
.show(假)