Scala sparksql数据帧
数据结构:Scala sparksql数据帧,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,数据结构: {"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}} 现在,我想将数据加载到一个数据帧中,并想将zip附加到loc。loc列名称应相同(loc)。转换后的数据应如下所示: {"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose222","Zip":
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}
现在,我想将数据加载到一个数据帧中,并想将zip附加到loc。loc列名称应相同(loc)。转换后的数据应如下所示:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}
没有RDD。我需要一个数据帧操作来实现这一点,最好使用
withColumn
功能。如何执行此操作?给定如下数据结构
val jsonString = """{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}"""
您可以根据需要将其转换为数据帧
val df = spark.read.json(sc.parallelize(jsonString::Nil))
那会给你什么
+-----------------------------------------------------+
|Emp |
+-----------------------------------------------------+
|[WrappedArray([222,Sanjose], [333,dayton]),John,2000]|
+-----------------------------------------------------+
//root
// |-- Emp: struct (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- Zip: string (nullable = true)
// | | | |-- loc: string (nullable = true)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
+-----------------------------------------------------------+
|Emp |
+-----------------------------------------------------------+
|[John,2000,WrappedArray([Sanjose222,222], [dayton333,333])]|
+-----------------------------------------------------------+
//root
// |-- Emp: struct (nullable = false)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- loc: string (nullable = true)
// | | | |-- Zip: string (nullable = true)
现在,要获得所需的输出,您需要将struct Emp column分隔为单独的列,并在udf函数中使用Address array column来获得所需的结果
其中udf
类中的address
是一个案例类
case class address(loc: String, Zip: String)
应该给你什么
+-----------------------------------------------------+
|Emp |
+-----------------------------------------------------+
|[WrappedArray([222,Sanjose], [333,dayton]),John,2000]|
+-----------------------------------------------------+
//root
// |-- Emp: struct (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- Zip: string (nullable = true)
// | | | |-- loc: string (nullable = true)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
+-----------------------------------------------------------+
|Emp |
+-----------------------------------------------------------+
|[John,2000,WrappedArray([Sanjose222,222], [dayton333,333])]|
+-----------------------------------------------------------+
//root
// |-- Emp: struct (nullable = false)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- loc: string (nullable = true)
// | | | |-- Zip: string (nullable = true)
现在,要获取json,只需使用.toJSON
,您应该
+-----------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------+
|{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}|
+-----------------------------------------------------------------------------------------------------------------+
请张贴您尝试过的内容