如何在Spark Scala中将数据帧列名转换为值

如何在Spark Scala中将数据帧列名转换为值,scala,dataframe,apache-spark,Scala,Dataframe,Apache Spark,大家好,我需要一些关于这个问题的建议,我有这个数据框: +------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------

大家好,我需要一些关于这个问题的建议,我有这个数据框:

+------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------------+-----------------------+----------------------+-----------------------+
|_id                     |h                   |inc|op |ts        |webhooks__0__failed_at |webhooks__0__status|webhooks__0__updated_at|webhooks__1__failed_at |webhooks__1__updated_at|webhooks__2__failed_at |webhooks__2__updated_at|webhooks__3__failed_at|webhooks__3__updated_at|webhooks__5__failed_at|webhooks__5__updated_at|
+------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------------+-----------------------+----------------------+-----------------------+
|5926115bffecf947d9fdf965|-3783513890158363801|148|u  |1564077339|null                   |null               |null                   |2019-07-25 17:55:39.813|2019-07-25 17:55:39.819|null                   |null                   |null                  |null                   |null                  |null                   |
|5926115bffecf947d9fdf965|-6421919050082865687|151|u  |1564077339|null                   |null               |null                   |2019-07-25 17:55:39.822|2019-07-25 17:55:39.845|null                   |null                   |null                  |null                   |null                  |null                   |
|5926115bffecf947d9fdf965|-1953717027542703837|155|u  |1564077339|null                   |null               |null                   |2019-07-25 17:55:39.873|2019-07-25 17:55:39.878|null                   |null                   |null                  |null                   |null                  |null                   |
|5926115bffecf947d9fdf965|7260191374440479618 |159|u  |1564077339|null                   |null               |null                   |2019-07-25 17:55:39.945|2019-07-25 17:55:39.951|null                   |null                   |null                  |null                   |null                  |null                   |
|57d17de901cc6a6c9e0000ab|-2430099739381353477|131|u  |1564077339|2019-07-25 17:55:39.722|error              |2019-07-25 17:55:39.731|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|4122669520839049341 |30 |u  |1564077341|null                   |listening          |2019-07-25 17:55:41.453|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|4122669520839049341 |30 |u  |1564077341|null                   |listening          |2019-07-25 17:55:41.453|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|-7191334145177061427|60 |u  |1564077341|null                   |null               |2019-07-25 17:55:41.768|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|1897433358396319399 |58 |u  |1564077341|null                   |null               |2019-07-25 17:55:41.767|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|1897433358396319399 |58 |u  |1564077341|null                   |null               |2019-07-25 17:55:41.767|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|58c6d048edbb6e09eb177639|8363076784039152000 |23 |u  |1564077342|null                   |null               |2019-07-25 17:55:42.216|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|-7191334145177061427|60 |u  |1564077341|null                   |null               |2019-07-25 17:55:41.768|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|58c6d048edbb6e09eb177639|8363076784039152000 |23 |u  |1564077342|null                   |null               |2019-07-25 17:55:42.216|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|-3790832816225805697|36 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.384|2019-07-25 17:55:46.400|null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|-1747137668935062717|34 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.385|2019-07-25 17:55:46.398|null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|-1747137668935062717|34 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.385|2019-07-25 17:55:46.398|null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|-3790832816225805697|36 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.384|2019-07-25 17:55:46.400|null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|6060575882395080442 |63 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.506|2019-07-25 17:55:46.529|null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|6060575882395080442 |63 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.506|2019-07-25 17:55:46.529|null                  |null                   |null                  |null                   |
|594e88f1ffecf918a14c143e|736029767610412482  |58 |u  |1564077346|2019-07-25 17:55:46.503|null               |2019-07-25 17:55:46.513|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
+------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------------+-----------------------+----------------------+-----------------------+
列名以如下格式增长

webhooks__0__failed_at,webhooks__0__failed_at
是否可以创建一个新的数据帧,将列名的编号作为索引,并像这样对结果进行分组

Index | webhooks__failed_at         |  webhooks__status

0     |     null                    |      null

0     |     null                    |      null

0     |    2019-07-25 17:55:39.722  |     error

我建议循环。下面的示例是基本的,但可以帮助您了解编写方向。该示例旨在搜索一个列而不是两个列,但是可以构建该示例以考虑多个不同的列,并在需要时构建到子流程中

//Build the DataFrame
val inputDF = spark.sql("select 'a' as Column_1, 'value_1' as test_0_value, 'value_2' as test_1_value, 'value_3' as test_2_value, 'value_4' as test_3_value")

//Make my TempDFs
var interimDF = spark.sql("select 'at-at' as column_1")
var actionDF = interimDF
var finalDF = interimDF

//This would be your search and replacement characteristics
val lookForValue = "test"
val replacementName = "test_check"

//Holds the constants
var constantArray = Array("Column_1")
//Based on above makes an array based on the columns you need to hit
var changeArray = Seq(inputDF.columns:_*).toDF("Columns").where("Columns rlike '" + lookForValue + "'").rdd.map(x=>x.mkString).collect

//Iterator
var iterator = 1

//Need this for below to run commands
var runStatement = Array("")

//Runs until all columns are hit
while(iterator <= changeArray.length) {
  //Adds constants
  runStatement = constantArray
  //Adds the current iteration columns
  runStatement = runStatement ++ Array(changeArray(iterator - 1) + " as " + replacementName)
  //Adds the iteration event
  runStatement = runStatement ++ Array("'" + iterator + "' as Iteration_Number")

  //Runs all the prebuilt commands
  actionDF = inputDF.selectExpr(runStatement:_*)

  //The reason for this is going from input -> action -> interim <-> final allows for interim and final to be semi-dynamic and allows vertical and horizontal catalogue keeping in spark
  interimDF = if(iterator == 1) {
    actionDF
  } else {
    finalDF.unionAll(actionDF)
  }
  finalDF = interimDF
  iterator = iterator + 1
}
//构建数据帧
val inputDF=spark.sql(“选择'a'作为列_1,'value_1'作为测试_0_值,'value_2'作为测试_1_值,'value_3'作为测试_2_值,'value_4'作为测试_3_值”)
//做我的临时工
var interimDF=spark.sql(“选择'at'at'作为第1列”)
var actionDF=interimDF
var finalDF=内部imdf
//这将是您的搜索和替换特征
val lookForValue=“测试”
val replacementName=“测试检查”
//保持不变
var constantArray=数组(“列1”)
//基于以上内容,根据需要点击的列生成一个数组
var changeArray=Seq(inputDF.columns:*).toDF(“columns”).where(“columns rlike”“+lookForValue+”)).rdd.map(x=>x.mkString)。收集
//迭代器
变量迭代器=1
//需要此命令才能运行以下命令
var runStatement=Array(“”)
//运行,直到命中所有列
while(迭代器操作->中间-最终允许中间和最终是半动态的,并允许垂直和水平目录保持在spark中
interimDF=if(迭代器==1){
行动
}否则{
最终联合行动(actionDF)
}
finalDF=interimDF
迭代器=迭代器+1
}

我建议循环。下面的示例是基本的,但可以帮助您指出写入方向。该示例旨在搜索一列而不是两列,但是可以将其构建为多个不同列的因素,并在需要时构建到子进程中

//Build the DataFrame
val inputDF = spark.sql("select 'a' as Column_1, 'value_1' as test_0_value, 'value_2' as test_1_value, 'value_3' as test_2_value, 'value_4' as test_3_value")

//Make my TempDFs
var interimDF = spark.sql("select 'at-at' as column_1")
var actionDF = interimDF
var finalDF = interimDF

//This would be your search and replacement characteristics
val lookForValue = "test"
val replacementName = "test_check"

//Holds the constants
var constantArray = Array("Column_1")
//Based on above makes an array based on the columns you need to hit
var changeArray = Seq(inputDF.columns:_*).toDF("Columns").where("Columns rlike '" + lookForValue + "'").rdd.map(x=>x.mkString).collect

//Iterator
var iterator = 1

//Need this for below to run commands
var runStatement = Array("")

//Runs until all columns are hit
while(iterator <= changeArray.length) {
  //Adds constants
  runStatement = constantArray
  //Adds the current iteration columns
  runStatement = runStatement ++ Array(changeArray(iterator - 1) + " as " + replacementName)
  //Adds the iteration event
  runStatement = runStatement ++ Array("'" + iterator + "' as Iteration_Number")

  //Runs all the prebuilt commands
  actionDF = inputDF.selectExpr(runStatement:_*)

  //The reason for this is going from input -> action -> interim <-> final allows for interim and final to be semi-dynamic and allows vertical and horizontal catalogue keeping in spark
  interimDF = if(iterator == 1) {
    actionDF
  } else {
    finalDF.unionAll(actionDF)
  }
  finalDF = interimDF
  iterator = iterator + 1
}
//构建数据帧
val inputDF=spark.sql(“选择'a'作为列_1,'value_1'作为测试_0_值,'value_2'作为测试_1_值,'value_3'作为测试_2_值,'value_4'作为测试_3_值”)
//做我的临时工
var interimDF=spark.sql(“选择'at'at'作为第1列”)
var actionDF=interimDF
var finalDF=内部imdf
//这将是您的搜索和替换特征
val lookForValue=“测试”
val replacementName=“测试检查”
//保持不变
var constantArray=数组(“列1”)
//基于以上内容,根据需要点击的列生成一个数组
var changeArray=Seq(inputDF.columns:*).toDF(“columns”).where(“columns rlike”“+lookForValue+”)).rdd.map(x=>x.mkString)。收集
//迭代器
变量迭代器=1
//需要此命令才能运行以下命令
var runStatement=Array(“”)
//运行,直到命中所有列
while(迭代器操作->中间-最终允许中间和最终是半动态的,并允许垂直和水平目录保持在spark中
interimDF=if(迭代器==1){
行动
}否则{
最终联合行动(actionDF)
}
finalDF=interimDF
迭代器=迭代器+1
}

如果您的初始数据帧被引用为
df
,并使用以下模式:

df.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks__0__failed_at: string (nullable = true)
 |-- webhooks__0__status: string (nullable = true)
 |-- webhooks__0__updated_at: string (nullable = true)
 |-- webhooks__1__failed_at: string (nullable = true)
 |-- webhooks__1__updated_at: string (nullable = true)
 |-- webhooks__2__failed_at: string (nullable = true)
 |-- webhooks__2__updated_at: string (nullable = true)
 |-- webhooks__3__failed_at: string (nullable = true)
 |-- webhooks__3__updated_at: string (nullable = true)
 |-- webhooks__5__failed_at: string (nullable = true)
 |-- webhooks__5__updated_at: string (nullable = true)
df_step1.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- index: string (nullable = false)
 |    |    |-- failed_at: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)
只需操作列名表达式,就可以将所有webhook数据重新组合到一个struct数组中,并且可以使用
lit
spark函数将列名作为值插入结果数据集中

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import df.sparkSession.implicits._

val (webhooks_columns, base_columns) = df.columns.partition(_.startsWith("webhooks"))

val parsed_webhooks_columns = webhooks_columns
     .map(_.split("__"))
     .map { case Array(_: String, idx: String, f: String) => (idx, f) }

val all_fields = parsed_webhooks_columns.map(_._2).toSet

val webhooks_structs = parsed_webhooks_columns
    .groupBy(_._1)
    .map(t => {
      val fields = t._2.map(_._2)
      val all_struct_fields = 
          Seq(lit(t._1).as("index")) ++ 
          all_fields.map { f =>
            if (fields.contains(f))
                col(s"webhooks__${t._1}__${f}").as(f)
            else
                lit(null).cast(StringType).as(f)
          }
      struct(all_struct_fields:_*)
    }).toArray


val df_step1 = df.select(base_columns.map(col) ++
    Seq(array(webhooks_structs:_*).as("webhooks")):_*)
上面代码中的大部分复杂性都涉及到这样一个事实:根据webhook索引,字段的数量会有所不同(索引0有一个在其他索引中找不到的状态字段),并且需要确保所有结构都具有完全相同的列,具有相同的类型,并且以相同的顺序进行转换

您将得到以下模式:

df.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks__0__failed_at: string (nullable = true)
 |-- webhooks__0__status: string (nullable = true)
 |-- webhooks__0__updated_at: string (nullable = true)
 |-- webhooks__1__failed_at: string (nullable = true)
 |-- webhooks__1__updated_at: string (nullable = true)
 |-- webhooks__2__failed_at: string (nullable = true)
 |-- webhooks__2__updated_at: string (nullable = true)
 |-- webhooks__3__failed_at: string (nullable = true)
 |-- webhooks__3__updated_at: string (nullable = true)
 |-- webhooks__5__failed_at: string (nullable = true)
 |-- webhooks__5__updated_at: string (nullable = true)
df_step1.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- index: string (nullable = false)
 |    |    |-- failed_at: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)
现在,您可以分解数据集,将不同的Webhook拆分为单独的行

val df_step2 = df_step1.withColumn("webhook", explode('webhooks)).drop("webhooks")
您将得到以下模式

df_step2.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhook: struct (nullable = false)
 |    |-- index: string (nullable = false)
 |    |-- failed_at: string (nullable = true)
 |    |-- status: string (nullable = true)
 |    |-- updated_at: string (nullable = true)
然后可以选择展平数据集以简化最终模式

val df_step2_flattened = df_step2.schema
       .filter(_.name == "webhook")
       .flatMap(_.dataType.asInstanceOf[StructType])
       .map(f => (s"webhook_${f.name}", 'webhook(f.name)))
       .foldLeft(df_step2) { case (df, (colname, colspec)) => df.withColumn(colname, colspec) }
       .drop("webhook")
此时,您可能希望筛选出在处使用null webhook_updated_的行,并运行所需的任何聚合

现在,您的最终架构是:

df_step2_flattened.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhook_index: string (nullable = false)
 |-- webhook_failed_at: string (nullable = true)
 |-- webhook_status: string (nullable = true)
 |-- webhook_updated_at: string (nullable = true)

这不是实现您想要的功能的唯一方法,但上述方法的主要优点是,它只使用内置的Spark表达式和函数,因此可以充分利用所有catalyst引擎优化。

如果您的初始数据帧被引用为具有以下模式的
df

df.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks__0__failed_at: string (nullable = true)
 |-- webhooks__0__status: string (nullable = true)
 |-- webhooks__0__updated_at: string (nullable = true)
 |-- webhooks__1__failed_at: string (nullable = true)
 |-- webhooks__1__updated_at: string (nullable = true)
 |-- webhooks__2__failed_at: string (nullable = true)
 |-- webhooks__2__updated_at: string (nullable = true)
 |-- webhooks__3__failed_at: string (nullable = true)
 |-- webhooks__3__updated_at: string (nullable = true)
 |-- webhooks__5__failed_at: string (nullable = true)
 |-- webhooks__5__updated_at: string (nullable = true)
df_step1.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- index: string (nullable = false)
 |    |    |-- failed_at: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)
只需操作列名表达式,就可以将所有webhook数据重新组合到一个struct数组中,并且可以使用
lit
spark函数将列名作为值插入结果数据集中

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import df.sparkSession.implicits._

val (webhooks_columns, base_columns) = df.columns.partition(_.startsWith("webhooks"))

val parsed_webhooks_columns = webhooks_columns
     .map(_.split("__"))
     .map { case Array(_: String, idx: String, f: String) => (idx, f) }

val all_fields = parsed_webhooks_columns.map(_._2).toSet

val webhooks_structs = parsed_webhooks_columns
    .groupBy(_._1)
    .map(t => {
      val fields = t._2.map(_._2)
      val all_struct_fields = 
          Seq(lit(t._1).as("index")) ++ 
          all_fields.map { f =>
            if (fields.contains(f))
                col(s"webhooks__${t._1}__${f}").as(f)
            else
                lit(null).cast(StringType).as(f)
          }
      struct(all_struct_fields:_*)
    }).toArray


val df_step1 = df.select(base_columns.map(col) ++
    Seq(array(webhooks_structs:_*).as("webhooks")):_*)
上面代码中的大部分复杂性都涉及到这样一个事实:根据webhook索引,字段的数量会有所不同(索引0有一个在其他索引中找不到的状态字段),并且需要确保所有结构都具有完全相同的列,具有相同的类型,并且以相同的顺序进行转换

您将得到以下模式:

df.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks__0__failed_at: string (nullable = true)
 |-- webhooks__0__status: string (nullable = true)
 |-- webhooks__0__updated_at: string (nullable = true)
 |-- webhooks__1__failed_at: string (nullable = true)
 |-- webhooks__1__updated_at: string (nullable = true)
 |-- webhooks__2__failed_at: string (nullable = true)
 |-- webhooks__2__updated_at: string (nullable = true)
 |-- webhooks__3__failed_at: string (nullable = true)
 |-- webhooks__3__updated_at: string (nullable = true)
 |-- webhooks__5__failed_at: string (nullable = true)
 |-- webhooks__5__updated_at: string (nullable = true)
df_step1.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- index: string (nullable = false)
 |    |    |-- failed_at: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)
现在,您可以分解数据集,将不同的Webhook拆分为单独的行

val df_step2 = df_step1.withColumn("webhook", explode('webhooks)).drop("webhooks")
您将得到以下模式

df_step2.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhook: struct (nullable = false)
 |    |-- index: string (nullable = false)
 |    |-- failed_at: string (nullable = true)
 |    |-- status: string (nullable = true)
 |    |-- updated_at: string (nullable = true)
然后可以选择展平数据集以简化最终模式

val df_step2_flattened = df_step2.schema
       .filter(_.name == "webhook")
       .flatMap(_.dataType.asInstanceOf[StructType])
       .map(f => (s"webhook_${f.name}", 'webhook(f.name)))
       .foldLeft(df_step2) { case (df, (colname, colspec)) => df.withColumn(colname, colspec) }
       .drop("webhook")
此时,您可能希望筛选出在处使用null webhook_updated_的行,并运行所需的任何聚合

现在,您的最终架构是:

df_step2_flattened.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhook_index: string (nullable = false)
 |-- webhook_failed_at: string (nullable = true)
 |-- webhook_status: string (nullable = true)
 |-- webhook_updated_at: string (nullable = true)

这不是实现您想要的唯一方法,但上述方法的主要优点是,它只使用内置的Spark表达式和函数,因此可以充分利用所有catalyst引擎优化。

您迄今为止所尝试的方法。这是完全可能的。您可以使用“生成不同的数据帧”,只使用所需的列基于索引,最后将所有列合并在一起,然后将其分组。如果列是静态的,则很容易。对于动态列数