Scala Spark中多枢轴柱的重命名和优化_Scala_Hadoop_Apache Spark_Pyspark

Scala Spark中多枢轴柱的重命名和优化

scala hadoop apache-spark pyspark

Scala Spark中多枢轴柱的重命名和优化,scala,hadoop,apache-spark,pyspark,Scala,Hadoop,Apache Spark,Pyspark,我的输入数据中有一组列，我基于这些列旋转数据数据透视完成后，我面临列标题的问题输入数据我的方法生成的输出- 预期的输出标题：我需要输出的标题看起来像- 到目前为止为实现我获得的输出而采取的步骤- // *Load the data* scala> val input_data =spark.read.option("header","true").option("inferschema","true").option("delimiter","\t").csv("s3://

我的输入数据中有一组列，我基于这些列旋转数据

数据透视完成后，我面临列标题的问题

输入数据

我的方法生成的输出-

预期的输出标题：

我需要输出的标题看起来像-

到目前为止为实现我获得的输出而采取的步骤-

// *Load the data*

scala> val input_data =spark.read.option("header","true").option("inferschema","true").option("delimiter","\t").csv("s3://mybucket/data.tsv")

// *Filter the data where residentFlag column = T*

scala> val filtered_data = input_data.select("numericID","age","salary","gender","residentFlag").filter($"residentFlag".contains("T"))

// *Now we will the pivot the filtered data by each column*

scala> val pivotByAge = filtered_data.groupBy("age","numericID").pivot("age").agg(expr("coalesce(first(numericID),'-')")).drop("age")

// *Pivot the data by the second column named "salary"*

scala> val pivotBySalary = filtered_data.groupBy("salary","numericID").pivot("salary").agg(expr("coalesce(first(numericID),'-')")).drop("salary")

// *Join the above two dataframes based on the numericID*

scala> val intermediateDf = pivotByAge.join(pivotBySalary,"numericID")

// *Now pivot the filtered data on Step 2 on the third column named Gender*

scala> val pivotByGender = filtered_data.groupBy("gender","numericID").pivot("gender").agg(expr("coalesce(first(numericID),'-')")).drop("gender")

// *Join the above dataframe with the intermediateDf*

scala> val outputDF= pivotByGender.join(intermediateDf ,"numericID")

如何重命名数据透视后生成的列

有没有一种不同的方法可以用于基于多列（近300列）旋转数据集

有什么改进性能的优化/建议吗？

您可以这样做，并使用正则表达式简化

var outputDF= pivotByGender.join(intermediateDf ,"numericID")

val cols: Array[String] = outputDF.columns

cols
  .foreach{
    cl => cl match {
        case "F" => outputDF = outputDF.withColumnRenamed(cl,s"gender_${cl}")
        case "M" => outputDF = outputDF.withColumnRenamed(cl,s"gender_${cl}")
        case cl.matches("""\\d{2}""") => outputDF = outputDF.withColumnRenamed(cl,s"age_${cl}")

      }
  }

您可以考虑使用遍历列表的枢轴列来依次创建枢轴数据文件，重命名生成的枢轴列，然后累加连接：

val data = Seq(
  (1, 30, 50000, "M"),
  (1, 25, 70000, "F"),
  (1, 40, 70000, "M"),
  (1, 30, 80000, "M"),
  (2, 30, 80000, "M"),
  (2, 40, 50000, "F"),
  (2, 25, 70000, "F")
).toDF("numericID", "age", "salary", "gender")

// Create list pivotCols which consists columns to pivot
val id = data.columns.head
val pivotCols = data.columns.filter(_ != "numericID")

// Create the first pivot dataframe from the first column in list pivotCols and
// rename each of the generated pivot columns
val c1 = pivotCols.head
val df1 = data.groupBy(c1, id).pivot(c1).agg(expr(s"coalesce(first($id),'-')")).drop(c1)
val df1Renamed = df1.columns.tail.foldLeft( df1 )( (acc, x) =>
      acc.withColumnRenamed(x, c1 + "_" + x)
    )

// Using the first pivot dataframe as the initial dataframe, process each of the
// remaining columns in list pivotCols similar to how the first column is processed,
// and cumulatively join each of them with the previously joined dataframe
pivotCols.tail.foldLeft( df1Renamed )(
  (accDF, c) => {
    val df = data.groupBy(c, id).pivot(c).agg(expr(s"coalesce(first($id),'-')")).drop(c)
    val dfRenamed = df.columns.tail.foldLeft( df )( (acc, x) =>
      acc.withColumnRenamed(x, c + "_" + x)
    )
    dfRenamed.join(accDF, Seq(id))
  }
)

// +---------+--------+--------+------------+------------+------------+------+------+------+
// |numericID|gender_F|gender_M|salary_50000|salary_70000|salary_80000|age_25|age_30|age_40|
// +---------+--------+--------+------------+------------+------------+------+------+------+
// |2        |2       |-       |2           |-           |-           |-     |2     |-     |
// |2        |2       |-       |2           |-           |-           |2     |-     |-     |
// |2        |2       |-       |2           |-           |-           |-     |-     |2     |
// |2        |2       |-       |-           |2           |-           |-     |2     |-     |
// |2        |2       |-       |-           |2           |-           |2     |-     |-     |
// |2        |2       |-       |-           |2           |-           |-     |-     |2     |
// |2        |2       |-       |-           |-           |2           |-     |2     |-     |
// |2        |2       |-       |-           |-           |2           |2     |-     |-     |
// |2        |2       |-       |-           |-           |2           |-     |-     |2     |
// |2        |-       |2       |2           |-           |-           |-     |2     |-     |
// |2        |-       |2       |2           |-           |-           |2     |-     |-     |
// |2        |-       |2       |2           |-           |-           |-     |-     |2     |
// |2        |-       |2       |-           |2           |-           |-     |2     |-     |
// |2        |-       |2       |-           |2           |-           |2     |-     |-     |
// |2        |-       |2       |-           |2           |-           |-     |-     |2     |
// |2        |-       |2       |-           |-           |2           |-     |2     |-     |
// |2        |-       |2       |-           |-           |2           |2     |-     |-     |
// |2        |-       |2       |-           |-           |2           |-     |-     |2     |
// |1        |-       |1       |-           |1           |-           |1     |-     |-     |
// |1        |-       |1       |-           |1           |-           |-     |-     |1     |
// ...

当您使用scala时，是否有理由将其标记为pyspark？这是因为可能有人在使用pyspark时遇到了类似的问题。这是一个火花问题，不是一个特定语言的问题。此外，问题的第二部分是关于优化的，因此该问题在所有spark执行环境中都是通用的。好的，您是否尝试过

df。WithColumnRename

？正如您所看到的，当前最终输出大约有10列，因此WithColumnRename在这里可以工作。但是，它将不起作用，原因有二：1。我不想通过查看2生成的标题来手动重命名列。实际上，输入文件将有300列，在这一点上会发生数据透视，因此使用WITHCOLUMNRENAME是不可行的，因为我事先不知道标题。正在寻找一种方法，我可以使用输入列名进行数据透视，然后以某种方式将其附加到从该列生成的标题中。您能否帮助理解您在这里到底在做什么-pivotCols.tail.foldLeft（df1Renamed）（（accDF，c）=>{val df=data.groupBy（c，id.pivot（c.agg）（expr（s）“coalesce（first（$id），'-'））.drop（c）val dfRenamed=df.columns.tail.foldLeft（df）（（acc，x）=>acc.withColumnRenamed（x，c+“”+x））dfRenamed.join（accDF，Seq（id））}）这对任何数量的列都有效吗？（在实际场景中，我有大约300列）如果在这个场景中，我们有另外两列（例如，国家和城市），会发生什么？请参阅更新答案中的注释。只要groupBy/pivot/agg结构保持不变，相同的代码将处理在list

pivotCols

中组合的任意数量的列。请记住，pivot数据帧的累积联接的大小将呈指数增长。在这种方法中，我需要为每个pivot列的所有可能结果编写所有案例。