Scala select和withcolumn都不使用foldleft

Scala select和withcolumn都不使用foldleft,scala,dataframe,apache-spark,foldleft,Scala,Dataframe,Apache Spark,Foldleft,尝试从嵌套架构中分解给定列。我试图通过在数据帧上向左折叠来实现这一点 这里我只处理了两种情况 如果列类型是struct,那么我将尝试通过select子句获取 若列类型是array,那个么我将尝试使用withColumn then select子句分解数据 以下是我的模式: import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val sc

尝试从嵌套架构中分解给定列。我试图通过在数据帧上向左折叠来实现这一点

这里我只处理了两种情况

  • 如果列类型是struct,那么我将尝试通过select子句获取
  • 若列类型是array,那个么我将尝试使用withColumn then select子句分解数据
  • 以下是我的模式:

    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types._
    
    val schema = StructType(Array(
        StructField("RootData", StructType(Seq(
            StructField("Rates",ArrayType(StructType(Array(
                StructField("Code",StringType,true), 
                StructField("Rate",StringType,true), 
                StructField("Type",StringType,true), 
                StructField("TargetValue",StringType,true)))), true), 
            StructField("RecordCount",LongType,true))),true), 
        StructField("CreationDate",StringType,true), 
        StructField("SysID",StringType,true), 
        StructField("ImportID",StringType,true)))
    
    
    |-- RootData: struct (nullable = true)
    |    |-- Rates: array (nullable = true)
    |    |    |-- element: struct (containsNull = true)
    |    |    |    |-- Code: string (nullable = true)
    |    |    |    |-- Rate: string (nullable = true)
    |    |    |    |-- Type: string (nullable = true)
    |    |    |    |-- TargetValue: string (nullable = true)
    |    |-- RecordCount: long (nullable = true)
    |-- CreationDate: string (nullable = true)
    |-- SysID: string (nullable = true)
    |-- ImportID: string (nullable = true)
    
    下面是代码片段:

     // Here sourceDF has nested schema dataframe
     // List of  nested columns 
    def execute(sourceDf: DataFrame, exp_Cols : Array[String]) = {
        var list = Array[String]()
        val df = exp_Cols.foldLeft(sourceDf){(df, colName) =>
            if ( df.columns.contains(colName) ) {
                val typeName = df.schema( colName ).dataType.typeName
                println("typeName " + typeName)
                if ( typeName == "struct" || typeName == "array") list = list :+ colName
                if (typeName == "struct") df.selectExpr("*",  colName + ".*")
                else if (typeName == "array") df.withColumn(colName, explode(col(colName))).selectExpr("*",  colName + ".*")
                else df 
            }
            df
        }
        println(list.toList)
        df.drop(list:_*)
    }
    
    但当我尝试以下陈述时,它的工作正如预期的那样。和我和foldleft写的一样

     nestedDf.selectExpr("*", "RootData.*").withColumn("Rates",explode($"Rates")).selectExpr("*","Rates.*").drop("RootData", "Rates")
    
    我是否在上述方法中犯了任何错误,或者我们能否以更好的方式实现这一点

    Am使用spark2.30版本和scala 2.11版本

    编辑:

    请查找以下示例数据:

    val jsonStr = """{"RootData":{"Rates":[{"Code":"USD","Rate":"2.007500000","Type":"Common","TargetValue":"BYR"},
    {"Code":"USD","Rate":"357.300000000","Type":"Common","TargetValue":"MRO"},
    {"Code":"USD","Rate":"21005.000000000","Type":"Common","TargetValue":"STD"},
    {"Code":"USD","Rate":"248520.960000000","Type":"Common","TargetValue":"VEF"},
    {"Code":"USD","Rate":"77.850000000","Type":"Common","TargetValue":"AFN"},
    {"Code":"USD","Rate":"475.150000000","Type":"Common","TargetValue":"AMD"},
    {"Code":"USD","Rate":"250.000000000","Type":"Common","TargetValue":"YER"},
    {"Code":"USD","Rate":"15.063500000","Type":"Common","TargetValue":"ZAR"},
    {"Code":"USD","Rate":"13.291500000","Type":"Common","TargetValue":"ZMW"},
    {"Code":"USD","Rate":"1.000000000","Type":"Common","TargetValue":"USD"}
    ],"RecordCount":10}, "CreationDate":"2020-01-01","SysID":"987654321","ImportID":"123456789"}"""
    
    val nestedDf = spark.read.json(Seq(jsonStr).toDS)
    val exp_cols = Array("RootData", "Rates")
    execute(nestedDf, exp_cols)
    
    我使用的临时解决方案如下:

    def forStructTypeCol(df : DataFrame, colName: String) = df.selectExpr("*", colName +".*")
    def forArrayTypeCol(df : DataFrame, colName: String) = df.withColumn(colName, explode(col(colName))).selectExpr("*", colName +".*")
    var t_nestedDf = nestedDf
    exp_cols.foreach(colName=> { t_nestedDf =  if ( t_nestedDf.columns.contains(colName) ) { val typeName = t_nestedDf.schema( colName ).dataType.typeName ; if ( typeName == "struct") forStructTypeCol(t_nestedDf, colName) else if (typeName == "array") forArrayTypeCol(t_nestedDf, colName) else t_nestedDf } else t_nestedDf  })
    val finaldf = t_nestedDf.drop(exp_cols:_*)
    

    我认为您的代码是错误的,因为您总是返回df,而不是包含附加列的df(可能您缺少else子句):


    您能提供一些示例数据,以便我们重现您的问题吗?@Oli:我用示例数据更新了问题。我遗漏了
    else
    子句,而不是保留else,我在应用else阻止其工作后直接返回df。谢谢
    def execute(sourceDf: DataFrame, exp_Cols : Array[String]) = {
        var list = Array[String]()
        val df = exp_Cols.foldLeft(sourceDf){(df, colName) =>
            if ( df.columns.contains(colName) ) {
                val typeName = df.schema( colName ).dataType.typeName
                println("typeName " + typeName)
                if ( typeName == "struct" || typeName == "array") list = list :+ colName
                if (typeName == "struct") df.selectExpr("*",  colName + ".*")
                else if (typeName == "array") df.withColumn(colName, explode(col(colName))).selectExpr("*",  colName + ".*")
                else df 
            } else {
                df
            }
        }
        println(list.toList)
        df.drop(list:_*)
    }