Scala sparksql数据帧_Scala_Apache Spark_Apache Spark Sql

Scala sparksql数据帧

scala apache-spark

Scala sparksql数据帧,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,数据结构： {"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}} 现在，我想将数据加载到一个数据帧中，并想将zip附加到loc。loc列名称应相同（loc）。转换后的数据应如下所示： {"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose222","Zip":

数据结构：

{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}

现在，我想将数据加载到一个数据帧中，并想将zip附加到loc。loc列名称应相同（loc）。转换后的数据应如下所示：

{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}

没有RDD。我需要一个数据帧操作来实现这一点，最好使用

withColumn

功能。如何执行此操作？

给定如下数据结构

val jsonString = """{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}"""

您可以根据需要将其转换为数据帧

val df = spark.read.json(sc.parallelize(jsonString::Nil))

那会给你什么

+-----------------------------------------------------+
|Emp                                                  |
+-----------------------------------------------------+
|[WrappedArray([222,Sanjose], [333,dayton]),John,2000]|
+-----------------------------------------------------+

//root
// |-- Emp: struct (nullable = true)
// |    |-- Address: array (nullable = true)
// |    |    |-- element: struct (containsNull = true)
// |    |    |    |-- Zip: string (nullable = true)
// |    |    |    |-- loc: string (nullable = true)
// |    |-- Name: string (nullable = true)
// |    |-- Sal: string (nullable = true)

+-----------------------------------------------------------+
|Emp                                                        |
+-----------------------------------------------------------+
|[John,2000,WrappedArray([Sanjose222,222], [dayton333,333])]|
+-----------------------------------------------------------+

//root
// |-- Emp: struct (nullable = false)
// |    |-- Name: string (nullable = true)
// |    |-- Sal: string (nullable = true)
// |    |-- Address: array (nullable = true)
// |    |    |-- element: struct (containsNull = true)
// |    |    |    |-- loc: string (nullable = true)
// |    |    |    |-- Zip: string (nullable = true)

现在，要获得所需的输出，您需要将struct Emp column分隔为单独的列，并在udf函数中使用Address array column来获得所需的结果

其中

udf

类中的

address

是一个

案例类

case class address(loc: String, Zip: String)

应该给你什么

+-----------------------------------------------------+
|Emp                                                  |
+-----------------------------------------------------+
|[WrappedArray([222,Sanjose], [333,dayton]),John,2000]|
+-----------------------------------------------------+

//root
// |-- Emp: struct (nullable = true)
// |    |-- Address: array (nullable = true)
// |    |    |-- element: struct (containsNull = true)
// |    |    |    |-- Zip: string (nullable = true)
// |    |    |    |-- loc: string (nullable = true)
// |    |-- Name: string (nullable = true)
// |    |-- Sal: string (nullable = true)

+-----------------------------------------------------------+
|Emp                                                        |
+-----------------------------------------------------------+
|[John,2000,WrappedArray([Sanjose222,222], [dayton333,333])]|
+-----------------------------------------------------------+

//root
// |-- Emp: struct (nullable = false)
// |    |-- Name: string (nullable = true)
// |    |-- Sal: string (nullable = true)
// |    |-- Address: array (nullable = true)
// |    |    |-- element: struct (containsNull = true)
// |    |    |    |-- loc: string (nullable = true)
// |    |    |    |-- Zip: string (nullable = true)

现在，要获取json，只需使用

.toJSON

，您应该

+-----------------------------------------------------------------------------------------------------------------+
|value                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}|
+-----------------------------------------------------------------------------------------------------------------+

请张贴您尝试过的内容