Spark-将平面数据帧映射到可配置的嵌套json模式_Json_Scala_Apache Spark_Case Class

Spark-将平面数据帧映射到可配置的嵌套json模式

json scala apache-spark

Spark-将平面数据帧映射到可配置的嵌套json模式,json,scala,apache-spark,case-class,Json,Scala,Apache Spark,Case Class,我有一个5-6列的平面数据框。我想嵌套它们并将其转换为嵌套的数据帧，这样我就可以将其写入拼花地板格式但是，我不想使用case类，因为我试图尽可能地保持代码的可配置性。我被这部分卡住了，需要一些帮助我的意见： ID ID-2 Count(apple) Count(banana) Count(potato) Count(Onion) 1 23 1 0 2 0 2 23 0 1

我有一个5-6列的平面数据框。我想嵌套它们并将其转换为嵌套的数据帧，这样我就可以将其写入拼花地板格式

但是，我不想使用case类，因为我试图尽可能地保持代码的可配置性。我被这部分卡住了，需要一些帮助

我的意见：

ID ID-2 Count(apple) Count(banana) Count(potato) Count(Onion)

 1  23    1             0             2             0

 2  23    0             1             0             1

 2  29    1             0             1             0

我的输出：

第1行：

{
  "id": 1,
  "ID-2": 23,
  "fruits": {
    "count of apple": 1,
    "count of banana": 0
  },
  "vegetables": {
    "count of potato": 2,
    "count of onion": 0
  }
}

我曾尝试在spark数据框中使用“map”函数，将我的值映射到case类。然而，我将在游戏中使用这些字段的名称，并且可能会更改它们

我不想维护case类并将行映射到sql列名，因为每次都会涉及代码更改

我在考虑用我想保留的数据框的列名维护一个Hashmap。例如，在本例中，我将“Count（apple）”映射为“Count of apple”。然而，我想不出一个简单的方法来将我的模式作为配置传递，然后在我的代码中映射它。

：：（双冒号）在scala列表中被视为“cons”。这是创建scala列表或将元素插入现有可变列表的方法

scala> val aList = 24 :: 34 :: 56 :: Nil
aList: List[Int] = List(24, 34, 56)

scala> 99 :: aList
res3: List[Int] = List(99, 24, 34, 56)

在第一个示例中，Nil是空列表，被认为是最右边cons操作的尾部

然而

scala> val anotherList = 23 :: 34
<console>:12: error: value :: is not a member of Int
       val anotherList = 23 :: 34

scala>val其他列表=23:：34
：12:错误：value:：不是Int的成员
val anotherList=23:：34

这会引发一个错误，因为没有要插入的现有列表。

：：（双冒号）在scala列表中被视为“cons”。这是创建scala列表或将元素插入现有可变列表的方法

scala> val aList = 24 :: 34 :: 56 :: Nil
aList: List[Int] = List(24, 34, 56)

scala> 99 :: aList
res3: List[Int] = List(99, 24, 34, 56)

在第一个示例中，Nil是空列表，被认为是最右边cons操作的尾部

然而

scala> val anotherList = 23 :: 34
<console>:12: error: value :: is not a member of Int
       val anotherList = 23 :: 34

scala>val其他列表=23:：34
：12:错误：value:：不是Int的成员
val anotherList=23:：34

这会引发错误，因为没有可插入的现有列表。

以下是一种使用scala

Map

类型使用以下数据集创建列映射的方法：

val data = Seq(
(1, 23, 1, 0, 2, 0),
(2, 23, 0, 1, 0, 1),
(2, 29, 1, 0, 1, 0)).toDF("ID", "ID-2", "count(apple)", "count(banana)", "count(potato)", "count(onion)")

首先，我们使用

scala.collection.immutable.Map

collection和负责映射的函数声明映射：

import org.apache.spark.sql.{Column, DataFrame}

val colMapping = Map(
        "count(banana)" -> "no of banana", 
        "count(apple)" -> "no of apples", 
        "count(potato)" -> "no of potatos", 
        "count(onion)" -> "no of onions")

def mapColumns(colsMapping: Map[String, String], df: DataFrame) : DataFrame = {
       val mapping = df
         .columns
         .map{ c => if (colsMapping.contains(c)) df(c).alias(colsMapping(c)) else df(c)}
         .toList

        df.select(mapping:_*)
}

该函数迭代给定数据帧的列，并使用

映射识别具有公共键的列。然后，它返回根据应用的映射更改名称（带别名）的列
mapColumns（colMapping，df）的输出。显示（false）
：
最后，我们通过struct
type生成水果和蔬菜：
df1.withColumn("fruits", struct(col(colMapping("count(banana)")), col(colMapping("count(apple)"))))
.withColumn("vegetables", struct(col(colMapping("count(potato)")), col(colMapping("count(onion)"))))
.drop(colMapping.values.toList:_*)
.toJSON
.show(false)

请注意，在完成转换后，我们将删除colMapping集合的所有列
输出：
+-----------------------------------------------------------------------------------------------------------------+
|value                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|{"ID":1,"ID-2":23,"fruits":{"no of banana":0,"no of apples":1},"vegetables":{"no of potatos":2,"no of onions":0}}|
|{"ID":2,"ID-2":23,"fruits":{"no of banana":1,"no of apples":0},"vegetables":{"no of potatos":0,"no of onions":1}}|
|{"ID":2,"ID-2":29,"fruits":{"no of banana":0,"no of apples":1},"vegetables":{"no of potatos":1,"no of onions":0}}|
+-----------------------------------------------------------------------------------------------------------------+

{"ID":"2","ID-2":"23","fruits":[{"Count(apple)":"0","Count(banana)":"1"}],"vegetables":[{"Count(potato)":"0","Count(Onion)":"1"}]}

下面是一种使用scalaMap
类型使用以下数据集创建列映射的方法：
val data = Seq(
(1, 23, 1, 0, 2, 0),
(2, 23, 0, 1, 0, 1),
(2, 29, 1, 0, 1, 0)).toDF("ID", "ID-2", "count(apple)", "count(banana)", "count(potato)", "count(onion)")

首先，我们使用scala.collection.immutable.Map
collection和负责映射的函数声明映射：
import org.apache.spark.sql.{Column, DataFrame}

val colMapping = Map(
        "count(banana)" -> "no of banana", 
        "count(apple)" -> "no of apples", 
        "count(potato)" -> "no of potatos", 
        "count(onion)" -> "no of onions")

def mapColumns(colsMapping: Map[String, String], df: DataFrame) : DataFrame = {
       val mapping = df
         .columns
         .map{ c => if (colsMapping.contains(c)) df(c).alias(colsMapping(c)) else df(c)}
         .toList

        df.select(mapping:_*)
}

该函数迭代给定数据帧的列，并使用映射识别具有公共键的列。然后，它返回根据应用的映射更改名称（带别名）的列
mapColumns（colMapping，df）的输出。显示（false）
：
最后，我们通过struct
type生成水果和蔬菜：
df1.withColumn("fruits", struct(col(colMapping("count(banana)")), col(colMapping("count(apple)"))))
.withColumn("vegetables", struct(col(colMapping("count(potato)")), col(colMapping("count(onion)"))))
.drop(colMapping.values.toList:_*)
.toJSON
.show(false)

请注意，在完成转换后，我们将删除colMapping集合的所有列
输出：
+-----------------------------------------------------------------------------------------------------------------+
|value                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|{"ID":1,"ID-2":23,"fruits":{"no of banana":0,"no of apples":1},"vegetables":{"no of potatos":2,"no of onions":0}}|
|{"ID":2,"ID-2":23,"fruits":{"no of banana":1,"no of apples":0},"vegetables":{"no of potatos":0,"no of onions":1}}|
|{"ID":2,"ID-2":29,"fruits":{"no of banana":0,"no of apples":1},"vegetables":{"no of potatos":1,"no of onions":0}}|
+-----------------------------------------------------------------------------------------------------------------+

{"ID":"2","ID-2":"23","fruits":[{"Count(apple)":"0","Count(banana)":"1"}],"vegetables":[{"Count(potato)":"0","Count(Onion)":"1"}]}

输出：
+-----------------------------------------------------------------------------------------------------------------+
|value                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|{"ID":1,"ID-2":23,"fruits":{"no of banana":0,"no of apples":1},"vegetables":{"no of potatos":2,"no of onions":0}}|
|{"ID":2,"ID-2":23,"fruits":{"no of banana":1,"no of apples":0},"vegetables":{"no of potatos":0,"no of onions":1}}|
|{"ID":2,"ID-2":29,"fruits":{"no of banana":0,"no of apples":1},"vegetables":{"no of potatos":1,"no of onions":0}}|
+-----------------------------------------------------------------------------------------------------------------+

{"ID":"2","ID-2":"23","fruits":[{"Count(apple)":"0","Count(banana)":"1"}],"vegetables":[{"Count(potato)":"0","Count(Onion)":"1"}]}

输出：
+-----------------------------------------------------------------------------------------------------------------+
|value                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|{"ID":1,"ID-2":23,"fruits":{"no of banana":0,"no of apples":1},"vegetables":{"no of potatos":2,"no of onions":0}}|
|{"ID":2,"ID-2":23,"fruits":{"no of banana":1,"no of apples":0},"vegetables":{"no of potatos":0,"no of onions":1}}|
|{"ID":2,"ID-2":29,"fruits":{"no of banana":0,"no of apples":1},"vegetables":{"no of potatos":1,"no of onions":0}}|
+-----------------------------------------------------------------------------------------------------------------+

{"ID":"2","ID-2":"23","fruits":[{"Count(apple)":"0","Count(banana)":"1"}],"vegetables":[{"Count(potato)":"0","Count(Onion)":"1"}]}

您好，这一次我添加了映射功能，但它缺少了Anks you，我找到了一种方法，将它安装到我的用例中！很好，那么我很高兴您好，这次我添加了映射功能，它丢失了Tanks you，我找到了一种方法，将它安装到我的用例中！那好，我很高兴