Scala 如何在数据集中存储自定义对象?
根据: 在我们期待Spark 2.0的同时,我们计划对数据集进行一些激动人心的改进,特别是: ... 自定义编码器–虽然我们目前为各种类型自动生成编码器,但我们希望为自定义对象打开一个API 并且试图在Scala 如何在数据集中存储自定义对象?,scala,apache-spark,apache-spark-dataset,apache-spark-encoders,Scala,Apache Spark,Apache Spark Dataset,Apache Spark Encoders,根据: 在我们期待Spark 2.0的同时,我们计划对数据集进行一些激动人心的改进,特别是: ... 自定义编码器–虽然我们目前为各种类型自动生成编码器,但我们希望为自定义对象打开一个API 并且试图在数据集中存储自定义类型会导致如下错误: 找不到数据集中存储的类型的编码器。通过导入sqlContext.implicits支持基本类型(Int、String等)和产品类型(case类)。将来的版本中将添加对序列化其他类型的支持 或: Java.lang.UnsupportedOperationEx
数据集中存储自定义类型会导致如下错误:
找不到数据集中存储的类型的编码器。通过导入sqlContext.implicits支持基本类型(Int、String等)和产品类型(case类)。将来的版本中将添加对序列化其他类型的支持
或:
Java.lang.UnsupportedOperationException:找不到…的编码器
是否有任何现有的解决办法
注意:此问题仅作为社区Wiki答案的入口点存在。随时更新/改进问题和答案
使用通用编码器
目前有两种通用编码器可用,后者明确描述为:
效率极低,只能作为最后手段使用
下课
class Bar(i: Int) {
override def toString = s"bar $i"
def bar = i
}
您可以通过添加隐式编码器来使用这些编码器:
object BarEncoders {
implicit def barEncoder: org.apache.spark.sql.Encoder[Bar] =
org.apache.spark.sql.Encoders.kryo[Bar]
}
可按如下方式一起使用:
object Main {
def main(args: Array[String]) {
val sc = new SparkContext("local", "test", new SparkConf())
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import BarEncoders._
val ds = Seq(new Bar(1)).toDS
ds.show
sc.stop()
}
}
它将对象存储为binary
列,因此当转换为DataFrame
时,您将得到以下模式:
root
|-- value: binary (nullable = true)
对于特定字段,也可以使用kryo
encoder对元组进行编码:
val longBarEncoder = Encoders.tuple(Encoders.scalaLong, Encoders.kryo[Bar])
spark.createDataset(Seq((1L, new Bar(1))))(longBarEncoder)
// org.apache.spark.sql.Dataset[(Long, Bar)] = [_1: bigint, _2: binary]
请注意,这里我们不依赖隐式编码器,而是显式地传递编码器,所以这很可能不适用于toDS
方法
使用隐式转换:
在可编码的表示和自定义类之间提供隐式转换,例如:
object BarConversions {
implicit def toInt(bar: Bar): Int = bar.bar
implicit def toBar(i: Int): Bar = new Bar(i)
}
object Main {
def main(args: Array[String]) {
val sc = new SparkContext("local", "test", new SparkConf())
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import BarConversions._
type EncodedBar = Int
val bars: RDD[EncodedBar] = sc.parallelize(Seq(new Bar(1)))
val barsDS = bars.toDS
barsDS.show
barsDS.map(_.bar).show
sc.stop()
}
}
相关问题:
使用通用编码器
目前有两种通用编码器可用,后者明确描述为:
效率极低,只能作为最后手段使用
下课
class Bar(i: Int) {
override def toString = s"bar $i"
def bar = i
}
您可以通过添加隐式编码器来使用这些编码器:
object BarEncoders {
implicit def barEncoder: org.apache.spark.sql.Encoder[Bar] =
org.apache.spark.sql.Encoders.kryo[Bar]
}
可按如下方式一起使用:
object Main {
def main(args: Array[String]) {
val sc = new SparkContext("local", "test", new SparkConf())
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import BarEncoders._
val ds = Seq(new Bar(1)).toDS
ds.show
sc.stop()
}
}
它将对象存储为binary
列,因此当转换为DataFrame
时,您将得到以下模式:
root
|-- value: binary (nullable = true)
对于特定字段,也可以使用kryo
encoder对元组进行编码:
val longBarEncoder = Encoders.tuple(Encoders.scalaLong, Encoders.kryo[Bar])
spark.createDataset(Seq((1L, new Bar(1))))(longBarEncoder)
// org.apache.spark.sql.Dataset[(Long, Bar)] = [_1: bigint, _2: binary]
请注意,这里我们不依赖隐式编码器,而是显式地传递编码器,所以这很可能不适用于toDS
方法
使用隐式转换:
在可编码的表示和自定义类之间提供隐式转换,例如:
object BarConversions {
implicit def toInt(bar: Bar): Int = bar.bar
implicit def toBar(i: Int): Bar = new Bar(i)
}
object Main {
def main(args: Array[String]) {
val sc = new SparkContext("local", "test", new SparkConf())
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
import BarConversions._
type EncodedBar = Int
val bars: RDD[EncodedBar] = sc.parallelize(Seq(new Bar(1)))
val barsDS = bars.toDS
barsDS.show
barsDS.map(_.bar).show
sc.stop()
}
}
相关问题:
编码器在Spark2.0
中的工作原理大致相同。而且Kryo
仍然是推荐的serialization
选择
您可以使用spark shell查看以下示例
scala> import spark.implicits._
import spark.implicits._
scala> import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Encoders
scala> case class NormalPerson(name: String, age: Int) {
| def aboutMe = s"I am ${name}. I am ${age} years old."
| }
defined class NormalPerson
scala> case class ReversePerson(name: Int, age: String) {
| def aboutMe = s"I am ${name}. I am ${age} years old."
| }
defined class ReversePerson
scala> val normalPersons = Seq(
| NormalPerson("Superman", 25),
| NormalPerson("Spiderman", 17),
| NormalPerson("Ironman", 29)
| )
normalPersons: Seq[NormalPerson] = List(NormalPerson(Superman,25), NormalPerson(Spiderman,17), NormalPerson(Ironman,29))
scala> val ds1 = sc.parallelize(normalPersons).toDS
ds1: org.apache.spark.sql.Dataset[NormalPerson] = [name: string, age: int]
scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
scala> ds1.show()
+---------+---+
| name|age|
+---------+---+
| Superman| 25|
|Spiderman| 17|
| Ironman| 29|
+---------+---+
scala> ds2.show()
+----+---------+
|name| age|
+----+---------+
| 25| Superman|
| 17|Spiderman|
| 29| Ironman|
+----+---------+
scala> ds1.foreach(p => println(p.aboutMe))
I am Ironman. I am 29 years old.
I am Superman. I am 25 years old.
I am Spiderman. I am 17 years old.
scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
scala> ds2.foreach(p => println(p.aboutMe))
I am 17. I am Spiderman years old.
I am 25. I am Superman years old.
I am 29. I am Ironman years old.
到目前为止]在目前的范围内没有合适的编码器
,因此我们的人没有被编码为二进制
值。但一旦我们使用Kryo
序列化提供一些隐式
编码器,这种情况就会改变
// Provide Encoders
scala> implicit val normalPersonKryoEncoder = Encoders.kryo[NormalPerson]
normalPersonKryoEncoder: org.apache.spark.sql.Encoder[NormalPerson] = class[value[0]: binary]
scala> implicit val reversePersonKryoEncoder = Encoders.kryo[ReversePerson]
reversePersonKryoEncoder: org.apache.spark.sql.Encoder[ReversePerson] = class[value[0]: binary]
// Ecoders will be used since they are now present in Scope
scala> val ds3 = sc.parallelize(normalPersons).toDS
ds3: org.apache.spark.sql.Dataset[NormalPerson] = [value: binary]
scala> val ds4 = ds3.map(np => ReversePerson(np.age, np.name))
ds4: org.apache.spark.sql.Dataset[ReversePerson] = [value: binary]
// now all our persons show up as binary values
scala> ds3.show()
+--------------------+
| value|
+--------------------+
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
+--------------------+
scala> ds4.show()
+--------------------+
| value|
+--------------------+
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
+--------------------+
// Our instances still work as expected
scala> ds3.foreach(p => println(p.aboutMe))
I am Ironman. I am 29 years old.
I am Spiderman. I am 17 years old.
I am Superman. I am 25 years old.
scala> ds4.foreach(p => println(p.aboutMe))
I am 25. I am Superman years old.
I am 29. I am Ironman years old.
I am 17. I am Spiderman years old.
编码器在Spark2.0
中的工作原理大致相同。而且Kryo
仍然是推荐的serialization
选择
您可以使用spark shell查看以下示例
scala> import spark.implicits._
import spark.implicits._
scala> import org.apache.spark.sql.Encoders
import org.apache.spark.sql.Encoders
scala> case class NormalPerson(name: String, age: Int) {
| def aboutMe = s"I am ${name}. I am ${age} years old."
| }
defined class NormalPerson
scala> case class ReversePerson(name: Int, age: String) {
| def aboutMe = s"I am ${name}. I am ${age} years old."
| }
defined class ReversePerson
scala> val normalPersons = Seq(
| NormalPerson("Superman", 25),
| NormalPerson("Spiderman", 17),
| NormalPerson("Ironman", 29)
| )
normalPersons: Seq[NormalPerson] = List(NormalPerson(Superman,25), NormalPerson(Spiderman,17), NormalPerson(Ironman,29))
scala> val ds1 = sc.parallelize(normalPersons).toDS
ds1: org.apache.spark.sql.Dataset[NormalPerson] = [name: string, age: int]
scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
scala> ds1.show()
+---------+---+
| name|age|
+---------+---+
| Superman| 25|
|Spiderman| 17|
| Ironman| 29|
+---------+---+
scala> ds2.show()
+----+---------+
|name| age|
+----+---------+
| 25| Superman|
| 17|Spiderman|
| 29| Ironman|
+----+---------+
scala> ds1.foreach(p => println(p.aboutMe))
I am Ironman. I am 29 years old.
I am Superman. I am 25 years old.
I am Spiderman. I am 17 years old.
scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name))
ds2: org.apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string]
scala> ds2.foreach(p => println(p.aboutMe))
I am 17. I am Spiderman years old.
I am 25. I am Superman years old.
I am 29. I am Ironman years old.
到目前为止]在目前的范围内没有合适的编码器
,因此我们的人没有被编码为二进制
值。但一旦我们使用Kryo
序列化提供一些隐式
编码器,这种情况就会改变
// Provide Encoders
scala> implicit val normalPersonKryoEncoder = Encoders.kryo[NormalPerson]
normalPersonKryoEncoder: org.apache.spark.sql.Encoder[NormalPerson] = class[value[0]: binary]
scala> implicit val reversePersonKryoEncoder = Encoders.kryo[ReversePerson]
reversePersonKryoEncoder: org.apache.spark.sql.Encoder[ReversePerson] = class[value[0]: binary]
// Ecoders will be used since they are now present in Scope
scala> val ds3 = sc.parallelize(normalPersons).toDS
ds3: org.apache.spark.sql.Dataset[NormalPerson] = [value: binary]
scala> val ds4 = ds3.map(np => ReversePerson(np.age, np.name))
ds4: org.apache.spark.sql.Dataset[ReversePerson] = [value: binary]
// now all our persons show up as binary values
scala> ds3.show()
+--------------------+
| value|
+--------------------+
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
+--------------------+
scala> ds4.show()
+--------------------+
| value|
+--------------------+
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
|[01 00 24 6C 69 6...|
+--------------------+
// Our instances still work as expected
scala> ds3.foreach(p => println(p.aboutMe))
I am Ironman. I am 29 years old.
I am Spiderman. I am 17 years old.
I am Superman. I am 25 years old.
scala> ds4.foreach(p => println(p.aboutMe))
I am 25. I am Superman years old.
I am 29. I am Ironman years old.
I am 17. I am Spiderman years old.
更新
这个答案仍然有效且信息丰富,尽管自2.2/2.3以来情况有所好转,它增加了对Set
、Seq
、Map
、Date
、Timestamp
和bigdecimic
的内置编码器支持。如果您坚持只使用case类和通常的Scala类型来创建类型,那么只使用SQLImplicits
中的隐式类型就可以了
不幸的是,几乎没有任何东西可以帮助解决这个问题。在or中搜索自2.0.0以来的@since
,发现主要与基元类型有关(以及对case类的一些调整)。因此,首先要说的是:目前还没有对自定义类编码器的真正好的支持。有了这些,接下来的就是一些技巧,考虑到我们目前拥有的资源,这些技巧能让我们做得尽可能好。作为一个预先声明:这不会完美地工作,我会尽我最大的努力,使所有的限制明确和预先
到底是什么问题
当您想要创建一个数据集时,Spark“需要一个编码器(将类型为T的JVM对象转换为内部Spark SQL表示形式或从内部Spark SQL表示形式转换而来),该编码器通常是通过SparkSession
的隐式自动创建的,或者可以通过调用编码器上的静态方法显式创建的(取自)。编码器将采用encoder[T]
的形式,其中T
是您正在编码的类型。第一个建议是添加import spark.implicits.\u
(它提供隐式编码器),第二个建议是使用一组编码器相关函数显式传入隐式编码器
没有可用于常规类的编码器,因此
import spark.implicits._
class MyObj(val i: Int)
// ...
val d = spark.createDataset(Seq(new MyObj(1),new MyObj(2),new MyObj(3)))
将给您以下隐式相关编译时错误:
找不到数据集中存储的类型的编码器。通过导入sqlContext.implicits支持基本类型(Int、String等)和产品类型(case类)。将来的版本中将添加对序列化其他类型的支持
但是,如果在扩展Product
的某个类中包装刚刚用于获取上述错误的任何类型,则错误会令人困惑地延迟
class MyObj(val i: Int, val u: java.util.UUID, val s: Set[String])
// alias for the type to convert to and from
type MyObjEncoded = (Int, String, Set[String])
// implicit conversions
implicit def toEncoded(o: MyObj): MyObjEncoded = (o.i, o.u.toString, o.s)
implicit def fromEncoded(e: MyObjEncoded): MyObj =
new MyObj(e._1, java.util.UUID.fromString(e._2), e._3)
val d = spark.createDataset(Seq[MyObjEncoded](
new MyObj(1, java.util.UUID.randomUUID, Set("foo")),
new MyObj(2, java.util.UUID.randomUUID, Set("bar"))
)).toDF("i","u","s").as[MyObjEncoded]
d.printSchema
// root
// |-- i: integer (nullable = false)
// |-- u: string (nullable = true)
// |-- s: binary (nullable = true)
import spark.sqlContext.implicits._
import org.apache.spark.sql.Encoders
implicit val encoder = Encoders.bean[MyClasss](classOf[MyClass])
dataFrame.as[MyClass]
public class Fruit implements Serializable {
private String name = "default-fruit";
private String color = "default-color";
// AllArgsConstructor
public Fruit(String name, String color) {
this.name = name;
this.color = color;
}
// NoArgsConstructor
public Fruit() {
this("default-fruit", "default-color");
}
// ...create getters and setters for above fields
// you figure it out
}
SparkSession spark = SparkSession.builder().getOrCreate();
JavaSparkContext jsc = new JavaSparkContext();
List<Fruit> fruitList = ImmutableList.of(
new Fruit("apple", "red"),
new Fruit("orange", "orange"),
new Fruit("grape", "purple"));
JavaRDD<Fruit> fruitJavaRDD = jsc.parallelize(fruitList);
RDD<Fruit> fruitRDD = fruitJavaRDD.rdd();
Encoder<Fruit> fruitBean = Encoders.bean(Fruit.class);
Dataset<Fruit> fruitDataset = spark.createDataset(rdd, bean);
trait CustomEnum { def value:String }
case object Foo extends CustomEnum { val value = "F" }
case object Bar extends CustomEnum { val value = "B" }
object CustomEnum {
def fromString(str:String) = Seq(Foo, Bar).find(_.value == str).get
}
// First define a UDT class for it:
class CustomEnumUDT extends UserDefinedType[CustomEnum] {
override def sqlType: DataType = org.apache.spark.sql.types.StringType
override def serialize(obj: CustomEnum): Any = org.apache.spark.unsafe.types.UTF8String.fromString(obj.value)
// Note that this will be a UTF8String type
override def deserialize(datum: Any): CustomEnum = CustomEnum.fromString(datum.toString)
override def userClass: Class[CustomEnum] = classOf[CustomEnum]
}
// Then Register the UDT Class!
// NOTE: you have to put this file into the org.apache.spark package!
UDTRegistration.register(classOf[CustomEnum].getName, classOf[CustomEnumUDT].getName)
case class UsingCustomEnum(id:Int, en:CustomEnum)
val seq = Seq(
UsingCustomEnum(1, Foo),
UsingCustomEnum(2, Bar),
UsingCustomEnum(3, Foo)
).toDS()
seq.filter(_.en == Foo).show()
println(seq.collect())
trait CustomPoly
case class FooPoly(id:Int) extends CustomPoly
case class BarPoly(value:String, secondValue:Long) extends CustomPoly
case class UsingPoly(id:Int, poly:CustomPoly)
Seq(
UsingPoly(1, new FooPoly(1)),
UsingPoly(2, new BarPoly("Blah", 123)),
UsingPoly(3, new FooPoly(1))
).toDS
polySeq.filter(_.poly match {
case FooPoly(value) => value == 1
case _ => false
}).show()
class CustomPolyUDT extends UserDefinedType[CustomPoly] {
val kryo = new Kryo()
override def sqlType: DataType = org.apache.spark.sql.types.BinaryType
override def serialize(obj: CustomPoly): Any = {
val bos = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(bos)
oos.writeObject(obj)
bos.toByteArray
}
override def deserialize(datum: Any): CustomPoly = {
val bis = new ByteArrayInputStream(datum.asInstanceOf[Array[Byte]])
val ois = new ObjectInputStream(bis)
val obj = ois.readObject()
obj.asInstanceOf[CustomPoly]
}
override def userClass: Class[CustomPoly] = classOf[CustomPoly]
}
// NOTE: The file you do this in has to be inside of the org.apache.spark package!
UDTRegistration.register(classOf[CustomPoly].getName, classOf[CustomPolyUDT].getName)
// As shown above:
case class UsingPoly(id:Int, poly:CustomPoly)
Seq(
UsingPoly(1, new FooPoly(1)),
UsingPoly(2, new BarPoly("Blah", 123)),
UsingPoly(3, new FooPoly(1))
).toDS
polySeq.filter(_.poly match {
case FooPoly(value) => value == 1
case _ => false
}).show()
class SerializableDenseVector(values: Array[Double]) extends breeze.linalg.DenseVector[Double](values) with DefinedByConstructorParams
implicit def BreezeVectorToSerializable(bv: breeze.linalg.DenseVector[Double]): SerializableDenseVector = bv.asInstanceOf[SerializableDenseVector]
import spark.implicits._
case class Wrap[T](unwrap: T)
class MyObj(val i: Int)
// ...
val d = spark.createDataset(Seq(Wrap(new MyObj(1)),Wrap(new MyObj(2)),Wrap(new MyObj(3))))
implicit val myEncoder = org.apache.spark.sql.Encoders.kryo[MyObj]
java.lang.UnsupportedOperationException: No Encoder found for MyObj
- field (class: "MyObj", name: "unwrap")
- root class: "Wrap"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
implicit val myWrapperEncoder = org.apache.spark.sql.Encoders.kryo[Wrap[MyObj]]