如何在Spark Scala中将数据帧列名转换为值_Scala_Dataframe_Apache Spark

如何在Spark Scala中将数据帧列名转换为值

scala dataframe apache-spark

如何在Spark Scala中将数据帧列名转换为值,scala,dataframe,apache-spark,Scala,Dataframe,Apache Spark,大家好，我需要一些关于这个问题的建议，我有这个数据框： +------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------

大家好，我需要一些关于这个问题的建议，我有这个数据框：

+------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------------+-----------------------+----------------------+-----------------------+
|_id                     |h                   |inc|op |ts        |webhooks__0__failed_at |webhooks__0__status|webhooks__0__updated_at|webhooks__1__failed_at |webhooks__1__updated_at|webhooks__2__failed_at |webhooks__2__updated_at|webhooks__3__failed_at|webhooks__3__updated_at|webhooks__5__failed_at|webhooks__5__updated_at|
+------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------------+-----------------------+----------------------+-----------------------+
|5926115bffecf947d9fdf965|-3783513890158363801|148|u  |1564077339|null                   |null               |null                   |2019-07-25 17:55:39.813|2019-07-25 17:55:39.819|null                   |null                   |null                  |null                   |null                  |null                   |
|5926115bffecf947d9fdf965|-6421919050082865687|151|u  |1564077339|null                   |null               |null                   |2019-07-25 17:55:39.822|2019-07-25 17:55:39.845|null                   |null                   |null                  |null                   |null                  |null                   |
|5926115bffecf947d9fdf965|-1953717027542703837|155|u  |1564077339|null                   |null               |null                   |2019-07-25 17:55:39.873|2019-07-25 17:55:39.878|null                   |null                   |null                  |null                   |null                  |null                   |
|5926115bffecf947d9fdf965|7260191374440479618 |159|u  |1564077339|null                   |null               |null                   |2019-07-25 17:55:39.945|2019-07-25 17:55:39.951|null                   |null                   |null                  |null                   |null                  |null                   |
|57d17de901cc6a6c9e0000ab|-2430099739381353477|131|u  |1564077339|2019-07-25 17:55:39.722|error              |2019-07-25 17:55:39.731|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|4122669520839049341 |30 |u  |1564077341|null                   |listening          |2019-07-25 17:55:41.453|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|4122669520839049341 |30 |u  |1564077341|null                   |listening          |2019-07-25 17:55:41.453|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|-7191334145177061427|60 |u  |1564077341|null                   |null               |2019-07-25 17:55:41.768|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|1897433358396319399 |58 |u  |1564077341|null                   |null               |2019-07-25 17:55:41.767|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|1897433358396319399 |58 |u  |1564077341|null                   |null               |2019-07-25 17:55:41.767|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|58c6d048edbb6e09eb177639|8363076784039152000 |23 |u  |1564077342|null                   |null               |2019-07-25 17:55:42.216|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5b9bf21bffecf966c2878b11|-7191334145177061427|60 |u  |1564077341|null                   |null               |2019-07-25 17:55:41.768|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|58c6d048edbb6e09eb177639|8363076784039152000 |23 |u  |1564077342|null                   |null               |2019-07-25 17:55:42.216|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|-3790832816225805697|36 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.384|2019-07-25 17:55:46.400|null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|-1747137668935062717|34 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.385|2019-07-25 17:55:46.398|null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|-1747137668935062717|34 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.385|2019-07-25 17:55:46.398|null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|-3790832816225805697|36 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.384|2019-07-25 17:55:46.400|null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|6060575882395080442 |63 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.506|2019-07-25 17:55:46.529|null                  |null                   |null                  |null                   |
|5ac6a0d3b795b013a5a73a43|6060575882395080442 |63 |u  |1564077346|null                   |null               |null                   |null                   |null                   |2019-07-25 17:55:46.506|2019-07-25 17:55:46.529|null                  |null                   |null                  |null                   |
|594e88f1ffecf918a14c143e|736029767610412482  |58 |u  |1564077346|2019-07-25 17:55:46.503|null               |2019-07-25 17:55:46.513|null                   |null                   |null                   |null                   |null                  |null                   |null                  |null                   |
+------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------------+-----------------------+----------------------+-----------------------+

列名以如下格式增长

webhooks__0__failed_at,webhooks__0__failed_at

是否可以创建一个新的数据帧，将列名的编号作为索引，并像这样对结果进行分组

Index | webhooks__failed_at         |  webhooks__status

0     |     null                    |      null

0     |     null                    |      null

0     |    2019-07-25 17:55:39.722  |     error

我建议循环。下面的示例是基本的，但可以帮助您了解编写方向。该示例旨在搜索一个列而不是两个列，但是可以构建该示例以考虑多个不同的列，并在需要时构建到子流程中

//Build the DataFrame
val inputDF = spark.sql("select 'a' as Column_1, 'value_1' as test_0_value, 'value_2' as test_1_value, 'value_3' as test_2_value, 'value_4' as test_3_value")

//Make my TempDFs
var interimDF = spark.sql("select 'at-at' as column_1")
var actionDF = interimDF
var finalDF = interimDF

//This would be your search and replacement characteristics
val lookForValue = "test"
val replacementName = "test_check"

//Holds the constants
var constantArray = Array("Column_1")
//Based on above makes an array based on the columns you need to hit
var changeArray = Seq(inputDF.columns:_*).toDF("Columns").where("Columns rlike '" + lookForValue + "'").rdd.map(x=>x.mkString).collect

//Iterator
var iterator = 1

//Need this for below to run commands
var runStatement = Array("")

//Runs until all columns are hit
while(iterator <= changeArray.length) {
  //Adds constants
  runStatement = constantArray
  //Adds the current iteration columns
  runStatement = runStatement ++ Array(changeArray(iterator - 1) + " as " + replacementName)
  //Adds the iteration event
  runStatement = runStatement ++ Array("'" + iterator + "' as Iteration_Number")

  //Runs all the prebuilt commands
  actionDF = inputDF.selectExpr(runStatement:_*)

  //The reason for this is going from input -> action -> interim <-> final allows for interim and final to be semi-dynamic and allows vertical and horizontal catalogue keeping in spark
  interimDF = if(iterator == 1) {
    actionDF
  } else {
    finalDF.unionAll(actionDF)
  }
  finalDF = interimDF
  iterator = iterator + 1
}

//构建数据帧
val inputDF=spark.sql（“选择'a'作为列_1，'value_1'作为测试_0_值，'value_2'作为测试_1_值，'value_3'作为测试_2_值，'value_4'作为测试_3_值”）
//做我的临时工
var interimDF=spark.sql（“选择'at'at'作为第1列”）
var actionDF=interimDF
var finalDF=内部imdf
//这将是您的搜索和替换特征
val lookForValue=“测试”
val replacementName=“测试检查”
//保持不变
var constantArray=数组（“列1”）
//基于以上内容，根据需要点击的列生成一个数组
var changeArray=Seq（inputDF.columns:*）.toDF（“columns”）.where（“columns rlike”“+lookForValue+”））.rdd.map（x=>x.mkString）。收集
//迭代器
变量迭代器=1
//需要此命令才能运行以下命令
var runStatement=Array（“”）
//运行，直到命中所有列
while（迭代器操作->中间-最终允许中间和最终是半动态的，并允许垂直和水平目录保持在spark中
interimDF=if（迭代器==1）{
行动
}否则{
最终联合行动（actionDF）
}
finalDF=interimDF
迭代器=迭代器+1
}

我建议循环。下面的示例是基本的，但可以帮助您指出写入方向。该示例旨在搜索一列而不是两列，但是可以将其构建为多个不同列的因素，并在需要时构建到子进程中

//Build the DataFrame
val inputDF = spark.sql("select 'a' as Column_1, 'value_1' as test_0_value, 'value_2' as test_1_value, 'value_3' as test_2_value, 'value_4' as test_3_value")

//Make my TempDFs
var interimDF = spark.sql("select 'at-at' as column_1")
var actionDF = interimDF
var finalDF = interimDF

//This would be your search and replacement characteristics
val lookForValue = "test"
val replacementName = "test_check"

//Holds the constants
var constantArray = Array("Column_1")
//Based on above makes an array based on the columns you need to hit
var changeArray = Seq(inputDF.columns:_*).toDF("Columns").where("Columns rlike '" + lookForValue + "'").rdd.map(x=>x.mkString).collect

//Iterator
var iterator = 1

//Need this for below to run commands
var runStatement = Array("")

//Runs until all columns are hit
while(iterator <= changeArray.length) {
  //Adds constants
  runStatement = constantArray
  //Adds the current iteration columns
  runStatement = runStatement ++ Array(changeArray(iterator - 1) + " as " + replacementName)
  //Adds the iteration event
  runStatement = runStatement ++ Array("'" + iterator + "' as Iteration_Number")

  //Runs all the prebuilt commands
  actionDF = inputDF.selectExpr(runStatement:_*)

  //The reason for this is going from input -> action -> interim <-> final allows for interim and final to be semi-dynamic and allows vertical and horizontal catalogue keeping in spark
  interimDF = if(iterator == 1) {
    actionDF
  } else {
    finalDF.unionAll(actionDF)
  }
  finalDF = interimDF
  iterator = iterator + 1
}

//构建数据帧
val inputDF=spark.sql（“选择'a'作为列_1，'value_1'作为测试_0_值，'value_2'作为测试_1_值，'value_3'作为测试_2_值，'value_4'作为测试_3_值”）
//做我的临时工
var interimDF=spark.sql（“选择'at'at'作为第1列”）
var actionDF=interimDF
var finalDF=内部imdf
//这将是您的搜索和替换特征
val lookForValue=“测试”
val replacementName=“测试检查”
//保持不变
var constantArray=数组（“列1”）
//基于以上内容，根据需要点击的列生成一个数组
var changeArray=Seq（inputDF.columns:*）.toDF（“columns”）.where（“columns rlike”“+lookForValue+”））.rdd.map（x=>x.mkString）。收集
//迭代器
变量迭代器=1
//需要此命令才能运行以下命令
var runStatement=Array（“”）
//运行，直到命中所有列
while（迭代器操作->中间-最终允许中间和最终是半动态的，并允许垂直和水平目录保持在spark中
interimDF=if（迭代器==1）{
行动
}否则{
最终联合行动（actionDF）
}
finalDF=interimDF
迭代器=迭代器+1
}

如果您的初始数据帧被引用为

df

，并使用以下模式：

df.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks__0__failed_at: string (nullable = true)
 |-- webhooks__0__status: string (nullable = true)
 |-- webhooks__0__updated_at: string (nullable = true)
 |-- webhooks__1__failed_at: string (nullable = true)
 |-- webhooks__1__updated_at: string (nullable = true)
 |-- webhooks__2__failed_at: string (nullable = true)
 |-- webhooks__2__updated_at: string (nullable = true)
 |-- webhooks__3__failed_at: string (nullable = true)
 |-- webhooks__3__updated_at: string (nullable = true)
 |-- webhooks__5__failed_at: string (nullable = true)
 |-- webhooks__5__updated_at: string (nullable = true)

df_step1.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- index: string (nullable = false)
 |    |    |-- failed_at: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)

只需操作列名表达式，就可以将所有webhook数据重新组合到一个struct数组中，并且可以使用

lit

spark函数将列名作为值插入结果数据集中

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import df.sparkSession.implicits._

val (webhooks_columns, base_columns) = df.columns.partition(_.startsWith("webhooks"))

val parsed_webhooks_columns = webhooks_columns
     .map(_.split("__"))
     .map { case Array(_: String, idx: String, f: String) => (idx, f) }

val all_fields = parsed_webhooks_columns.map(_._2).toSet

val webhooks_structs = parsed_webhooks_columns
    .groupBy(_._1)
    .map(t => {
      val fields = t._2.map(_._2)
      val all_struct_fields = 
          Seq(lit(t._1).as("index")) ++ 
          all_fields.map { f =>
            if (fields.contains(f))
                col(s"webhooks__${t._1}__${f}").as(f)
            else
                lit(null).cast(StringType).as(f)
          }
      struct(all_struct_fields:_*)
    }).toArray


val df_step1 = df.select(base_columns.map(col) ++
    Seq(array(webhooks_structs:_*).as("webhooks")):_*)

上面代码中的大部分复杂性都涉及到这样一个事实：根据webhook索引，字段的数量会有所不同（索引0有一个在其他索引中找不到的状态字段），并且需要确保所有结构都具有完全相同的列，具有相同的类型，并且以相同的顺序进行转换

您将得到以下模式：

df.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks__0__failed_at: string (nullable = true)
 |-- webhooks__0__status: string (nullable = true)
 |-- webhooks__0__updated_at: string (nullable = true)
 |-- webhooks__1__failed_at: string (nullable = true)
 |-- webhooks__1__updated_at: string (nullable = true)
 |-- webhooks__2__failed_at: string (nullable = true)
 |-- webhooks__2__updated_at: string (nullable = true)
 |-- webhooks__3__failed_at: string (nullable = true)
 |-- webhooks__3__updated_at: string (nullable = true)
 |-- webhooks__5__failed_at: string (nullable = true)
 |-- webhooks__5__updated_at: string (nullable = true)

df_step1.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- index: string (nullable = false)
 |    |    |-- failed_at: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)

现在，您可以分解数据集，将不同的Webhook拆分为单独的行

val df_step2 = df_step1.withColumn("webhook", explode('webhooks)).drop("webhooks")

您将得到以下模式

df_step2.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhook: struct (nullable = false)
 |    |-- index: string (nullable = false)
 |    |-- failed_at: string (nullable = true)
 |    |-- status: string (nullable = true)
 |    |-- updated_at: string (nullable = true)

然后可以选择展平数据集以简化最终模式

val df_step2_flattened = df_step2.schema
       .filter(_.name == "webhook")
       .flatMap(_.dataType.asInstanceOf[StructType])
       .map(f => (s"webhook_${f.name}", 'webhook(f.name)))
       .foldLeft(df_step2) { case (df, (colname, colspec)) => df.withColumn(colname, colspec) }
       .drop("webhook")

此时，您可能希望筛选出在处使用null webhook_updated_的行，并运行所需的任何聚合

现在，您的最终架构是：

df_step2_flattened.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhook_index: string (nullable = false)
 |-- webhook_failed_at: string (nullable = true)
 |-- webhook_status: string (nullable = true)
 |-- webhook_updated_at: string (nullable = true)

这不是实现您想要的功能的唯一方法，但上述方法的主要优点是，它只使用内置的Spark表达式和函数，因此可以充分利用所有catalyst引擎优化。

如果您的初始数据帧被引用为具有以下模式的

df

：

df.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks__0__failed_at: string (nullable = true)
 |-- webhooks__0__status: string (nullable = true)
 |-- webhooks__0__updated_at: string (nullable = true)
 |-- webhooks__1__failed_at: string (nullable = true)
 |-- webhooks__1__updated_at: string (nullable = true)
 |-- webhooks__2__failed_at: string (nullable = true)
 |-- webhooks__2__updated_at: string (nullable = true)
 |-- webhooks__3__failed_at: string (nullable = true)
 |-- webhooks__3__updated_at: string (nullable = true)
 |-- webhooks__5__failed_at: string (nullable = true)
 |-- webhooks__5__updated_at: string (nullable = true)

df_step1.printSchema
root
 |-- _id: string (nullable = true)
 |-- h: string (nullable = true)
 |-- inc: string (nullable = true)
 |-- op: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- webhooks: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- index: string (nullable = false)
 |    |    |-- failed_at: string (nullable = true)
 |    |    |-- status: string (nullable = true)
 |    |    |-- updated_at: string (nullable = true)