Scala 如何在数据集中存储自定义对象?

Scala 如何在数据集中存储自定义对象?,scala,apache-spark,apache-spark-dataset,apache-spark-encoders,Scala,Apache Spark,Apache Spark Dataset,Apache Spark Encoders,根据: 在我们期待Spark 2.0的同时,我们计划对数据集进行一些激动人心的改进,特别是: ... 自定义编码器–虽然我们目前为各种类型自动生成编码器,但我们希望为自定义对象打开一个API 并且试图在数据集中存储自定义类型会导致如下错误: 找不到数据集中存储的类型的编码器。通过导入sqlContext.implicits支持基本类型(Int、String等)和产品类型(case类)。将来的版本中将添加对序列化其他类型的支持 或: Java.lang.UnsupportedOperationEx

根据:

在我们期待Spark 2.0的同时,我们计划对数据集进行一些激动人心的改进,特别是: ... 自定义编码器–虽然我们目前为各种类型自动生成编码器,但我们希望为自定义对象打开一个API

并且试图在
数据集中存储自定义类型会导致如下错误:

找不到数据集中存储的类型的编码器。通过导入sqlContext.implicits支持基本类型(Int、String等)和产品类型(case类)。将来的版本中将添加对序列化其他类型的支持

或:

Java.lang.UnsupportedOperationException:找不到…的编码器

是否有任何现有的解决办法


注意:此问题仅作为社区Wiki答案的入口点存在。随时更新/改进问题和答案

  • 使用通用编码器

    目前有两种通用编码器可用,后者明确描述为:

    效率极低,只能作为最后手段使用

    下课

    class Bar(i: Int) {
      override def toString = s"bar $i"
      def bar = i
    }
    
    您可以通过添加隐式编码器来使用这些编码器:

    object BarEncoders {
      implicit def barEncoder: org.apache.spark.sql.Encoder[Bar] = 
      org.apache.spark.sql.Encoders.kryo[Bar]
    }
    
    可按如下方式一起使用:

    object Main {
      def main(args: Array[String]) {
        val sc = new SparkContext("local",  "test", new SparkConf())
        val sqlContext = new SQLContext(sc)
        import sqlContext.implicits._
        import BarEncoders._
    
        val ds = Seq(new Bar(1)).toDS
        ds.show
    
        sc.stop()
      }
    }
    
    它将对象存储为
    binary
    列,因此当转换为
    DataFrame
    时,您将得到以下模式:

    root
     |-- value: binary (nullable = true)
    
    对于特定字段,也可以使用
    kryo
    encoder对元组进行编码:

    val longBarEncoder = Encoders.tuple(Encoders.scalaLong, Encoders.kryo[Bar])
    
    spark.createDataset(Seq((1L, new Bar(1))))(longBarEncoder)
    // org.apache.spark.sql.Dataset[(Long, Bar)] = [_1: bigint, _2: binary]
    
    请注意,这里我们不依赖隐式编码器,而是显式地传递编码器,所以这很可能不适用于
    toDS
    方法

  • 使用隐式转换:

    在可编码的表示和自定义类之间提供隐式转换,例如:

    object BarConversions {
      implicit def toInt(bar: Bar): Int = bar.bar
      implicit def toBar(i: Int): Bar = new Bar(i)
    }
    
    object Main {
      def main(args: Array[String]) {
        val sc = new SparkContext("local",  "test", new SparkConf())
        val sqlContext = new SQLContext(sc)
        import sqlContext.implicits._
        import BarConversions._
    
        type EncodedBar = Int
    
        val bars: RDD[EncodedBar]  = sc.parallelize(Seq(new Bar(1)))
        val barsDS = bars.toDS
    
        barsDS.show
        barsDS.map(_.bar).show
    
        sc.stop()
      }
    }
    
  • 相关问题:

  • 使用通用编码器

    目前有两种通用编码器可用,后者明确描述为:

    效率极低,只能作为最后手段使用

    下课

    class Bar(i: Int) {
      override def toString = s"bar $i"
      def bar = i
    }
    
    您可以通过添加隐式编码器来使用这些编码器:

    object BarEncoders {
      implicit def barEncoder: org.apache.spark.sql.Encoder[Bar] = 
      org.apache.spark.sql.Encoders.kryo[Bar]
    }
    
    可按如下方式一起使用:

    object Main {
      def main(args: Array[String]) {
        val sc = new SparkContext("local",  "test", new SparkConf())
        val sqlContext = new SQLContext(sc)
        import sqlContext.implicits._
        import BarEncoders._
    
        val ds = Seq(new Bar(1)).toDS
        ds.show
    
        sc.stop()
      }
    }
    
    它将对象存储为
    binary
    列,因此当转换为
    DataFrame
    时,您将得到以下模式:

    root
     |-- value: binary (nullable = true)
    
    对于特定字段,也可以使用
    kryo
    encoder对元组进行编码:

    val longBarEncoder = Encoders.tuple(Encoders.scalaLong, Encoders.kryo[Bar])
    
    spark.createDataset(Seq((1L, new Bar(1))))(longBarEncoder)
    // org.apache.spark.sql.Dataset[(Long, Bar)] = [_1: bigint, _2: binary]
    
    请注意,这里我们不依赖隐式编码器,而是显式地传递编码器,所以这很可能不适用于
    toDS
    方法

  • 使用隐式转换:

    在可编码的表示和自定义类之间提供隐式转换,例如:

    object BarConversions {
      implicit def toInt(bar: Bar): Int = bar.bar
      implicit def toBar(i: Int): Bar = new Bar(i)
    }
    
    object Main {
      def main(args: Array[String]) {
        val sc = new SparkContext("local",  "test", new SparkConf())
        val sqlContext = new SQLContext(sc)
        import sqlContext.implicits._
        import BarConversions._
    
        type EncodedBar = Int
    
        val bars: RDD[EncodedBar]  = sc.parallelize(Seq(new Bar(1)))
        val barsDS = bars.toDS
    
        barsDS.show
        barsDS.map(_.bar).show
    
        sc.stop()
      }
    }
    
  • 相关问题:


    编码器在
    Spark2.0
    中的工作原理大致相同。而且
    Kryo
    仍然是推荐的
    serialization
    选择

    您可以使用spark shell查看以下示例

    scala> import spark.implicits._
    import spark.implicits._
    
    scala> import org.apache.spark.sql.Encoders
    import org.apache.spark.sql.Encoders
    
    scala> case class NormalPerson(name: String, age: Int) {
     |   def aboutMe = s"I am ${name}. I am ${age} years old."
     | }
    defined class NormalPerson
    
    scala> case class ReversePerson(name: Int, age: String) {
     |   def aboutMe = s"I am ${name}. I am ${age} years old."
     | }
    defined class ReversePerson
    
    scala> val normalPersons = Seq(
     |   NormalPerson("Superman", 25),
     |   NormalPerson("Spiderman", 17),
     |   NormalPerson("Ironman", 29)
     | )
    normalPersons: Seq[NormalPerson] = List(NormalPerson(Superman,25), NormalPerson(Spiderman,17), NormalPerson(Ironman,29))
    
    scala> val ds1 = sc.parallelize(normalPersons).toDS
    ds1: org.apache.spark.sql.Dataset[NormalPerson] = [name: string, age: int]
    
    scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
    ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
    
    scala> ds1.show()
    +---------+---+
    |     name|age|
    +---------+---+
    | Superman| 25|
    |Spiderman| 17|
    |  Ironman| 29|
    +---------+---+
    
    scala> ds2.show()
    +----+---------+
    |name|      age|
    +----+---------+
    |  25| Superman|
    |  17|Spiderman|
    |  29|  Ironman|
    +----+---------+
    
    scala> ds1.foreach(p => println(p.aboutMe))
    I am Ironman. I am 29 years old.
    I am Superman. I am 25 years old.
    I am Spiderman. I am 17 years old.
    
    scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
    ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
    
    scala> ds2.foreach(p => println(p.aboutMe))
    I am 17. I am Spiderman years old.
    I am 25. I am Superman years old.
    I am 29. I am Ironman years old.
    
    到目前为止]在目前的范围内没有
    合适的编码器
    ,因此我们的人没有被编码为
    二进制
    值。但一旦我们使用
    Kryo
    序列化提供一些
    隐式
    编码器,这种情况就会改变

    // Provide Encoders
    
    scala> implicit val normalPersonKryoEncoder = Encoders.kryo[NormalPerson]
    normalPersonKryoEncoder: org.apache.spark.sql.Encoder[NormalPerson] = class[value[0]: binary]
    
    scala> implicit val reversePersonKryoEncoder = Encoders.kryo[ReversePerson]
    reversePersonKryoEncoder: org.apache.spark.sql.Encoder[ReversePerson] = class[value[0]: binary]
    
    // Ecoders will be used since they are now present in Scope
    
    scala> val ds3 = sc.parallelize(normalPersons).toDS
    ds3: org.apache.spark.sql.Dataset[NormalPerson] = [value: binary]
    
    scala> val ds4 = ds3.map(np => ReversePerson(np.age, np.name))
    ds4: org.apache.spark.sql.Dataset[ReversePerson] = [value: binary]
    
    // now all our persons show up as binary values
    scala> ds3.show()
    +--------------------+
    |               value|
    +--------------------+
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    +--------------------+
    
    scala> ds4.show()
    +--------------------+
    |               value|
    +--------------------+
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    +--------------------+
    
    // Our instances still work as expected    
    
    scala> ds3.foreach(p => println(p.aboutMe))
    I am Ironman. I am 29 years old.
    I am Spiderman. I am 17 years old.
    I am Superman. I am 25 years old.
    
    scala> ds4.foreach(p => println(p.aboutMe))
    I am 25. I am Superman years old.
    I am 29. I am Ironman years old.
    I am 17. I am Spiderman years old.
    

    编码器在
    Spark2.0
    中的工作原理大致相同。而且
    Kryo
    仍然是推荐的
    serialization
    选择

    您可以使用spark shell查看以下示例

    scala> import spark.implicits._
    import spark.implicits._
    
    scala> import org.apache.spark.sql.Encoders
    import org.apache.spark.sql.Encoders
    
    scala> case class NormalPerson(name: String, age: Int) {
     |   def aboutMe = s"I am ${name}. I am ${age} years old."
     | }
    defined class NormalPerson
    
    scala> case class ReversePerson(name: Int, age: String) {
     |   def aboutMe = s"I am ${name}. I am ${age} years old."
     | }
    defined class ReversePerson
    
    scala> val normalPersons = Seq(
     |   NormalPerson("Superman", 25),
     |   NormalPerson("Spiderman", 17),
     |   NormalPerson("Ironman", 29)
     | )
    normalPersons: Seq[NormalPerson] = List(NormalPerson(Superman,25), NormalPerson(Spiderman,17), NormalPerson(Ironman,29))
    
    scala> val ds1 = sc.parallelize(normalPersons).toDS
    ds1: org.apache.spark.sql.Dataset[NormalPerson] = [name: string, age: int]
    
    scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
    ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
    
    scala> ds1.show()
    +---------+---+
    |     name|age|
    +---------+---+
    | Superman| 25|
    |Spiderman| 17|
    |  Ironman| 29|
    +---------+---+
    
    scala> ds2.show()
    +----+---------+
    |name|      age|
    +----+---------+
    |  25| Superman|
    |  17|Spiderman|
    |  29|  Ironman|
    +----+---------+
    
    scala> ds1.foreach(p => println(p.aboutMe))
    I am Ironman. I am 29 years old.
    I am Superman. I am 25 years old.
    I am Spiderman. I am 17 years old.
    
    scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
    ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
    
    scala> ds2.foreach(p => println(p.aboutMe))
    I am 17. I am Spiderman years old.
    I am 25. I am Superman years old.
    I am 29. I am Ironman years old.
    
    到目前为止]在目前的范围内没有
    合适的编码器
    ,因此我们的人没有被编码为
    二进制
    值。但一旦我们使用
    Kryo
    序列化提供一些
    隐式
    编码器,这种情况就会改变

    // Provide Encoders
    
    scala> implicit val normalPersonKryoEncoder = Encoders.kryo[NormalPerson]
    normalPersonKryoEncoder: org.apache.spark.sql.Encoder[NormalPerson] = class[value[0]: binary]
    
    scala> implicit val reversePersonKryoEncoder = Encoders.kryo[ReversePerson]
    reversePersonKryoEncoder: org.apache.spark.sql.Encoder[ReversePerson] = class[value[0]: binary]
    
    // Ecoders will be used since they are now present in Scope
    
    scala> val ds3 = sc.parallelize(normalPersons).toDS
    ds3: org.apache.spark.sql.Dataset[NormalPerson] = [value: binary]
    
    scala> val ds4 = ds3.map(np => ReversePerson(np.age, np.name))
    ds4: org.apache.spark.sql.Dataset[ReversePerson] = [value: binary]
    
    // now all our persons show up as binary values
    scala> ds3.show()
    +--------------------+
    |               value|
    +--------------------+
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    +--------------------+
    
    scala> ds4.show()
    +--------------------+
    |               value|
    +--------------------+
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    |[01 00 24 6C 69 6...|
    +--------------------+
    
    // Our instances still work as expected    
    
    scala> ds3.foreach(p => println(p.aboutMe))
    I am Ironman. I am 29 years old.
    I am Spiderman. I am 17 years old.
    I am Superman. I am 25 years old.
    
    scala> ds4.foreach(p => println(p.aboutMe))
    I am 25. I am Superman years old.
    I am 29. I am Ironman years old.
    I am 17. I am Spiderman years old.
    
    更新 这个答案仍然有效且信息丰富,尽管自2.2/2.3以来情况有所好转,它增加了对
    Set
    Seq
    Map
    Date
    Timestamp
    bigdecimic
    的内置编码器支持。如果您坚持只使用case类和通常的Scala类型来创建类型,那么只使用
    SQLImplicits
    中的隐式类型就可以了


    不幸的是,几乎没有任何东西可以帮助解决这个问题。在or中搜索自2.0.0以来的
    @since
    ,发现主要与基元类型有关(以及对case类的一些调整)。因此,首先要说的是:目前还没有对自定义类编码器的真正好的支持。有了这些,接下来的就是一些技巧,考虑到我们目前拥有的资源,这些技巧能让我们做得尽可能好。作为一个预先声明:这不会完美地工作,我会尽我最大的努力,使所有的限制明确和预先

    到底是什么问题 当您想要创建一个数据集时,Spark“需要一个编码器(将类型为T的JVM对象转换为内部Spark SQL表示形式或从内部Spark SQL表示形式转换而来),该编码器通常是通过
    SparkSession
    的隐式自动创建的,或者可以通过调用
    编码器上的静态方法显式创建的(取自)。编码器将采用
    encoder[T]
    的形式,其中
    T
    是您正在编码的类型。第一个建议是添加
    import spark.implicits.\u
    (它提供隐式编码器),第二个建议是使用一组编码器相关函数显式传入隐式编码器

    没有可用于常规类的编码器,因此

    import spark.implicits._
    class MyObj(val i: Int)
    // ...
    val d = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
    
    将给您以下隐式相关编译时错误:

    找不到数据集中存储的类型的编码器。通过导入sqlContext.implicits支持基本类型(Int、String等)和产品类型(case类)。将来的版本中将添加对序列化其他类型的支持

    但是,如果在扩展
    Product
    的某个类中包装刚刚用于获取上述错误的任何类型,则错误会令人困惑地延迟
    class MyObj(val i: Int, val u: java.util.UUID, val s: Set[String])
    
    // alias for the type to convert to and from
    type MyObjEncoded = (Int, String, Set[String])
    
    // implicit conversions
    implicit def toEncoded(o: MyObj): MyObjEncoded = (o.i, o.u.toString, o.s)
    implicit def fromEncoded(e: MyObjEncoded): MyObj =
      new MyObj(e._1, java.util.UUID.fromString(e._2), e._3)
    
    val d = spark.createDataset(Seq[MyObjEncoded](
      new MyObj(1, java.util.UUID.randomUUID, Set("foo")),
      new MyObj(2, java.util.UUID.randomUUID, Set("bar"))
    )).toDF("i","u","s").as[MyObjEncoded]
    
    d.printSchema
    // root
    //  |-- i: integer (nullable = false)
    //  |-- u: string (nullable = true)
    //  |-- s: binary (nullable = true)
    
    import spark.sqlContext.implicits._
    import org.apache.spark.sql.Encoders
    implicit val encoder = Encoders.bean[MyClasss](classOf[MyClass])
    
    dataFrame.as[MyClass]
    
    public class Fruit implements Serializable {
        private String name  = "default-fruit";
        private String color = "default-color";
    
        // AllArgsConstructor
        public Fruit(String name, String color) {
            this.name  = name;
            this.color = color;
        }
    
        // NoArgsConstructor
        public Fruit() {
            this("default-fruit", "default-color");
        }
    
        // ...create getters and setters for above fields
        // you figure it out
    }
    
    SparkSession spark = SparkSession.builder().getOrCreate();
    JavaSparkContext jsc = new JavaSparkContext();
    
    List<Fruit> fruitList = ImmutableList.of(
        new Fruit("apple", "red"),
        new Fruit("orange", "orange"),
        new Fruit("grape", "purple"));
    JavaRDD<Fruit> fruitJavaRDD = jsc.parallelize(fruitList);
    
    
    RDD<Fruit> fruitRDD = fruitJavaRDD.rdd();
    Encoder<Fruit> fruitBean = Encoders.bean(Fruit.class);
    Dataset<Fruit> fruitDataset = spark.createDataset(rdd, bean);
    
    trait CustomEnum { def value:String }
    case object Foo extends CustomEnum  { val value = "F" }
    case object Bar extends CustomEnum  { val value = "B" }
    object CustomEnum {
      def fromString(str:String) = Seq(Foo, Bar).find(_.value == str).get
    }
    
    // First define a UDT class for it:
    class CustomEnumUDT extends UserDefinedType[CustomEnum] {
      override def sqlType: DataType = org.apache.spark.sql.types.StringType
      override def serialize(obj: CustomEnum): Any = org.apache.spark.unsafe.types.UTF8String.fromString(obj.value)
      // Note that this will be a UTF8String type
      override def deserialize(datum: Any): CustomEnum = CustomEnum.fromString(datum.toString)
      override def userClass: Class[CustomEnum] = classOf[CustomEnum]
    }
    
    // Then Register the UDT Class!
    // NOTE: you have to put this file into the org.apache.spark package!
    UDTRegistration.register(classOf[CustomEnum].getName, classOf[CustomEnumUDT].getName)
    
    case class UsingCustomEnum(id:Int, en:CustomEnum)
    
    val seq = Seq(
      UsingCustomEnum(1, Foo),
      UsingCustomEnum(2, Bar),
      UsingCustomEnum(3, Foo)
    ).toDS()
    seq.filter(_.en == Foo).show()
    println(seq.collect())
    
    trait CustomPoly
    case class FooPoly(id:Int) extends CustomPoly
    case class BarPoly(value:String, secondValue:Long) extends CustomPoly
    
    case class UsingPoly(id:Int, poly:CustomPoly)
    
    Seq(
      UsingPoly(1, new FooPoly(1)),
      UsingPoly(2, new BarPoly("Blah", 123)),
      UsingPoly(3, new FooPoly(1))
    ).toDS
    
    polySeq.filter(_.poly match {
      case FooPoly(value) => value == 1
      case _ => false
    }).show()
    
    class CustomPolyUDT extends UserDefinedType[CustomPoly] {
      val kryo = new Kryo()
    
      override def sqlType: DataType = org.apache.spark.sql.types.BinaryType
      override def serialize(obj: CustomPoly): Any = {
        val bos = new ByteArrayOutputStream()
        val oos = new ObjectOutputStream(bos)
        oos.writeObject(obj)
    
        bos.toByteArray
      }
      override def deserialize(datum: Any): CustomPoly = {
        val bis = new ByteArrayInputStream(datum.asInstanceOf[Array[Byte]])
        val ois = new ObjectInputStream(bis)
        val obj = ois.readObject()
        obj.asInstanceOf[CustomPoly]
      }
    
      override def userClass: Class[CustomPoly] = classOf[CustomPoly]
    }
    
    // NOTE: The file you do this in has to be inside of the org.apache.spark package!
    UDTRegistration.register(classOf[CustomPoly].getName, classOf[CustomPolyUDT].getName)
    
    // As shown above:
    case class UsingPoly(id:Int, poly:CustomPoly)
    
    Seq(
      UsingPoly(1, new FooPoly(1)),
      UsingPoly(2, new BarPoly("Blah", 123)),
      UsingPoly(3, new FooPoly(1))
    ).toDS
    
    polySeq.filter(_.poly match {
      case FooPoly(value) => value == 1
      case _ => false
    }).show()
    
    class SerializableDenseVector(values: Array[Double]) extends breeze.linalg.DenseVector[Double](values) with DefinedByConstructorParams
    implicit def BreezeVectorToSerializable(bv: breeze.linalg.DenseVector[Double]): SerializableDenseVector = bv.asInstanceOf[SerializableDenseVector]
    
    import spark.implicits._
    case class Wrap[T](unwrap: T)
    class MyObj(val i: Int)
    // ...
    val d = spark.createDataset(Seq(Wrap(new MyObj(1)),Wrap(new MyObj(2)),Wrap(new MyObj(3))))
    
    implicit val myEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
    
    java.lang.UnsupportedOperationException: No Encoder found for MyObj
    - field (class: "MyObj", name: "unwrap")
    - root class: "Wrap"
      at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
    
    implicit val myWrapperEncoder = org.apache.spark.sql.Encoders.kryo[Wrap[MyObj]]