如何将java复杂对象转换为spark数据帧
我正在使用JavaSpark,下面是我的代码如何将java复杂对象转换为spark数据帧,java,dataframe,apache-spark,Java,Dataframe,Apache Spark,我正在使用JavaSpark,下面是我的代码 JavaRDD<MyComplexEntity> myObjectJavaRDD = resultJavaRDD.flatMap(result -> result.getMyObjects()); DataFrame df = sqlContext.createDataFrame(myObjectJavaRDD, MyComplexEntity.class); df.saveAsParquetFile("s3a://m
JavaRDD<MyComplexEntity> myObjectJavaRDD = resultJavaRDD.flatMap(result -> result.getMyObjects());
DataFrame df = sqlContext.createDataFrame(myObjectJavaRDD, MyComplexEntity.class);
df.saveAsParquetFile("s3a://mybucket/test.parquet");
问题是我在从myObjectJavaRDD创建数据帧时在步骤2失败。如何将复杂java对象列表转换为数据帧。谢谢不管怎样,你能把它转换成Scala吗 Scala支持这种情况下的
case类
对于您的案例,挑战在于您有一个Seq/Array
的internal
case类as=>private java.util.ArrayList secodaryId代码>
所以可以按照下面的方法来做
// inner case class Identifier
case class Identifier(Id : Integer , uuid : String)
val innerVal = Seq(Identifier(1,"gsgsg"),Identifier(2,"dvggwgwg"))
// Outer case class MyComplexEntity
case class MyComplexEntity(notes : String, identifierArray : Seq[Identifier])
val outerVal = MyComplexEntity("Hello", innerVal)
请注意=>
// See how it is Infered
// unMappedDs: org.apache.spark.sql.Dataset[(String, Seq[(Int, String)])] = [_1: string, _2: array<struct<_1:int,_2:string>>]
outerVal:MyComplexEntity包含标识符对象的列表,如下所示
outerVal:MyComplexEntity=MyComplexEntity(你好,列表(标识符(1,gsgsg),标识符(2,dvggwgwg))
现在,使用数据集
import spark.implicits._
// Convert Our Input Data in Same Structure as your MyComplexEntity
// Only Trick is To 'Reflect' A Seq[(Int,String)] => Seq[Identifier]
// Hence we have to do 2 Mapping once for Outer Case class (MyComplexEntity) And Once For Inner Seq of Identifier
// If We Just Take this Input Data and Convert To DataSet ( without any Schema Inference)
// This is How It looks
val inputData = Seq(("Some DAY",Seq((210,"wert67"),(310,"bill123"))),
("I WILL BE", Seq((420,"henry678"),(1000,"baba123"))),
("Saturday Night",Seq((1000,"Roger123"),(2000,"God345")))
)
val unMappedDs = inputData.toDS
给我们提供=>
// See how it is Infered
// unMappedDs: org.apache.spark.sql.Dataset[(String, Seq[(Int, String)])] = [_1: string, _2: array<struct<_1:int,_2:string>>]
我们得到的结构类似于=>
// See how it is Infered
// unMappedDs: org.apache.spark.sql.Dataset[(String, Seq[(Int, String)])] = [_1: string, _2: array<struct<_1:int,_2:string>>]
resultDs:org.apache.spark.sql.Dataset[MyComplexEntity]=[注意:字符串,标识符数组:数组]
和数据如下:
+--------------+--------------------------------+
|notes |identifierArray |
+--------------+--------------------------------+
|Some DAY |[[210,wert67], [310,bill123]] |
|I WILL BE |[[420,henry678], [1000,baba123]]|
|Saturday Night|[[1000,Roger123], [2000,God345]]|
+--------------+--------------------------------+
使用Scala很容易。
谢谢。非常感谢您的详细回答,@SanBan。然而,我不能选择走哪条路。我是在Java编写的遗留代码的基础上构建的。不管怎样,我可以用java实现这一点吗?还是就你所知没有办法?再次感谢。只需在Java语法中使用map(x=>MyComplexEntity(x.\u 1,x.\u 2.map(y=>Identifier(y.\u 1,y.\u 2)))
。如果您喜欢答案,请向上投票:-)
+--------------+--------------------------------+
|notes |identifierArray |
+--------------+--------------------------------+
|Some DAY |[[210,wert67], [310,bill123]] |
|I WILL BE |[[420,henry678], [1000,baba123]]|
|Saturday Night|[[1000,Roger123], [2000,God345]]|
+--------------+--------------------------------+