如何在Spark Scala中将数据帧列名转换为值
大家好,我需要一些关于这个问题的建议,我有这个数据框:如何在Spark Scala中将数据帧列名转换为值,scala,dataframe,apache-spark,Scala,Dataframe,Apache Spark,大家好,我需要一些关于这个问题的建议,我有这个数据框: +------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------
+------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------------+-----------------------+----------------------+-----------------------+
|_id |h |inc|op |ts |webhooks__0__failed_at |webhooks__0__status|webhooks__0__updated_at|webhooks__1__failed_at |webhooks__1__updated_at|webhooks__2__failed_at |webhooks__2__updated_at|webhooks__3__failed_at|webhooks__3__updated_at|webhooks__5__failed_at|webhooks__5__updated_at|
+------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------------+-----------------------+----------------------+-----------------------+
|5926115bffecf947d9fdf965|-3783513890158363801|148|u |1564077339|null |null |null |2019-07-25 17:55:39.813|2019-07-25 17:55:39.819|null |null |null |null |null |null |
|5926115bffecf947d9fdf965|-6421919050082865687|151|u |1564077339|null |null |null |2019-07-25 17:55:39.822|2019-07-25 17:55:39.845|null |null |null |null |null |null |
|5926115bffecf947d9fdf965|-1953717027542703837|155|u |1564077339|null |null |null |2019-07-25 17:55:39.873|2019-07-25 17:55:39.878|null |null |null |null |null |null |
|5926115bffecf947d9fdf965|7260191374440479618 |159|u |1564077339|null |null |null |2019-07-25 17:55:39.945|2019-07-25 17:55:39.951|null |null |null |null |null |null |
|57d17de901cc6a6c9e0000ab|-2430099739381353477|131|u |1564077339|2019-07-25 17:55:39.722|error |2019-07-25 17:55:39.731|null |null |null |null |null |null |null |null |
|5b9bf21bffecf966c2878b11|4122669520839049341 |30 |u |1564077341|null |listening |2019-07-25 17:55:41.453|null |null |null |null |null |null |null |null |
|5b9bf21bffecf966c2878b11|4122669520839049341 |30 |u |1564077341|null |listening |2019-07-25 17:55:41.453|null |null |null |null |null |null |null |null |
|5b9bf21bffecf966c2878b11|-7191334145177061427|60 |u |1564077341|null |null |2019-07-25 17:55:41.768|null |null |null |null |null |null |null |null |
|5b9bf21bffecf966c2878b11|1897433358396319399 |58 |u |1564077341|null |null |2019-07-25 17:55:41.767|null |null |null |null |null |null |null |null |
|5b9bf21bffecf966c2878b11|1897433358396319399 |58 |u |1564077341|null |null |2019-07-25 17:55:41.767|null |null |null |null |null |null |null |null |
|58c6d048edbb6e09eb177639|8363076784039152000 |23 |u |1564077342|null |null |2019-07-25 17:55:42.216|null |null |null |null |null |null |null |null |
|5b9bf21bffecf966c2878b11|-7191334145177061427|60 |u |1564077341|null |null |2019-07-25 17:55:41.768|null |null |null |null |null |null |null |null |
|58c6d048edbb6e09eb177639|8363076784039152000 |23 |u |1564077342|null |null |2019-07-25 17:55:42.216|null |null |null |null |null |null |null |null |
|5ac6a0d3b795b013a5a73a43|-3790832816225805697|36 |u |1564077346|null |null |null |null |null |2019-07-25 17:55:46.384|2019-07-25 17:55:46.400|null |null |null |null |
|5ac6a0d3b795b013a5a73a43|-1747137668935062717|34 |u |1564077346|null |null |null |null |null |2019-07-25 17:55:46.385|2019-07-25 17:55:46.398|null |null |null |null |
|5ac6a0d3b795b013a5a73a43|-1747137668935062717|34 |u |1564077346|null |null |null |null |null |2019-07-25 17:55:46.385|2019-07-25 17:55:46.398|null |null |null |null |
|5ac6a0d3b795b013a5a73a43|-3790832816225805697|36 |u |1564077346|null |null |null |null |null |2019-07-25 17:55:46.384|2019-07-25 17:55:46.400|null |null |null |null |
|5ac6a0d3b795b013a5a73a43|6060575882395080442 |63 |u |1564077346|null |null |null |null |null |2019-07-25 17:55:46.506|2019-07-25 17:55:46.529|null |null |null |null |
|5ac6a0d3b795b013a5a73a43|6060575882395080442 |63 |u |1564077346|null |null |null |null |null |2019-07-25 17:55:46.506|2019-07-25 17:55:46.529|null |null |null |null |
|594e88f1ffecf918a14c143e|736029767610412482 |58 |u |1564077346|2019-07-25 17:55:46.503|null |2019-07-25 17:55:46.513|null |null |null |null |null |null |null |null |
+------------------------+--------------------+---+---+----------+-----------------------+-------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+----------------------+-----------------------+----------------------+-----------------------+
列名以如下格式增长
webhooks__0__failed_at,webhooks__0__failed_at
是否可以创建一个新的数据帧,将列名的编号作为索引,并像这样对结果进行分组
Index | webhooks__failed_at | webhooks__status
0 | null | null
0 | null | null
0 | 2019-07-25 17:55:39.722 | error
我建议循环。下面的示例是基本的,但可以帮助您了解编写方向。该示例旨在搜索一个列而不是两个列,但是可以构建该示例以考虑多个不同的列,并在需要时构建到子流程中
//Build the DataFrame
val inputDF = spark.sql("select 'a' as Column_1, 'value_1' as test_0_value, 'value_2' as test_1_value, 'value_3' as test_2_value, 'value_4' as test_3_value")
//Make my TempDFs
var interimDF = spark.sql("select 'at-at' as column_1")
var actionDF = interimDF
var finalDF = interimDF
//This would be your search and replacement characteristics
val lookForValue = "test"
val replacementName = "test_check"
//Holds the constants
var constantArray = Array("Column_1")
//Based on above makes an array based on the columns you need to hit
var changeArray = Seq(inputDF.columns:_*).toDF("Columns").where("Columns rlike '" + lookForValue + "'").rdd.map(x=>x.mkString).collect
//Iterator
var iterator = 1
//Need this for below to run commands
var runStatement = Array("")
//Runs until all columns are hit
while(iterator <= changeArray.length) {
//Adds constants
runStatement = constantArray
//Adds the current iteration columns
runStatement = runStatement ++ Array(changeArray(iterator - 1) + " as " + replacementName)
//Adds the iteration event
runStatement = runStatement ++ Array("'" + iterator + "' as Iteration_Number")
//Runs all the prebuilt commands
actionDF = inputDF.selectExpr(runStatement:_*)
//The reason for this is going from input -> action -> interim <-> final allows for interim and final to be semi-dynamic and allows vertical and horizontal catalogue keeping in spark
interimDF = if(iterator == 1) {
actionDF
} else {
finalDF.unionAll(actionDF)
}
finalDF = interimDF
iterator = iterator + 1
}
//构建数据帧
val inputDF=spark.sql(“选择'a'作为列_1,'value_1'作为测试_0_值,'value_2'作为测试_1_值,'value_3'作为测试_2_值,'value_4'作为测试_3_值”)
//做我的临时工
var interimDF=spark.sql(“选择'at'at'作为第1列”)
var actionDF=interimDF
var finalDF=内部imdf
//这将是您的搜索和替换特征
val lookForValue=“测试”
val replacementName=“测试检查”
//保持不变
var constantArray=数组(“列1”)
//基于以上内容,根据需要点击的列生成一个数组
var changeArray=Seq(inputDF.columns:*).toDF(“columns”).where(“columns rlike”“+lookForValue+”)).rdd.map(x=>x.mkString)。收集
//迭代器
变量迭代器=1
//需要此命令才能运行以下命令
var runStatement=Array(“”)
//运行,直到命中所有列
while(迭代器操作->中间-最终允许中间和最终是半动态的,并允许垂直和水平目录保持在spark中
interimDF=if(迭代器==1){
行动
}否则{
最终联合行动(actionDF)
}
finalDF=interimDF
迭代器=迭代器+1
}
我建议循环。下面的示例是基本的,但可以帮助您指出写入方向。该示例旨在搜索一列而不是两列,但是可以将其构建为多个不同列的因素,并在需要时构建到子进程中
//Build the DataFrame
val inputDF = spark.sql("select 'a' as Column_1, 'value_1' as test_0_value, 'value_2' as test_1_value, 'value_3' as test_2_value, 'value_4' as test_3_value")
//Make my TempDFs
var interimDF = spark.sql("select 'at-at' as column_1")
var actionDF = interimDF
var finalDF = interimDF
//This would be your search and replacement characteristics
val lookForValue = "test"
val replacementName = "test_check"
//Holds the constants
var constantArray = Array("Column_1")
//Based on above makes an array based on the columns you need to hit
var changeArray = Seq(inputDF.columns:_*).toDF("Columns").where("Columns rlike '" + lookForValue + "'").rdd.map(x=>x.mkString).collect
//Iterator
var iterator = 1
//Need this for below to run commands
var runStatement = Array("")
//Runs until all columns are hit
while(iterator <= changeArray.length) {
//Adds constants
runStatement = constantArray
//Adds the current iteration columns
runStatement = runStatement ++ Array(changeArray(iterator - 1) + " as " + replacementName)
//Adds the iteration event
runStatement = runStatement ++ Array("'" + iterator + "' as Iteration_Number")
//Runs all the prebuilt commands
actionDF = inputDF.selectExpr(runStatement:_*)
//The reason for this is going from input -> action -> interim <-> final allows for interim and final to be semi-dynamic and allows vertical and horizontal catalogue keeping in spark
interimDF = if(iterator == 1) {
actionDF
} else {
finalDF.unionAll(actionDF)
}
finalDF = interimDF
iterator = iterator + 1
}
//构建数据帧
val inputDF=spark.sql(“选择'a'作为列_1,'value_1'作为测试_0_值,'value_2'作为测试_1_值,'value_3'作为测试_2_值,'value_4'作为测试_3_值”)
//做我的临时工
var interimDF=spark.sql(“选择'at'at'作为第1列”)
var actionDF=interimDF
var finalDF=内部imdf
//这将是您的搜索和替换特征
val lookForValue=“测试”
val replacementName=“测试检查”
//保持不变
var constantArray=数组(“列1”)
//基于以上内容,根据需要点击的列生成一个数组
var changeArray=Seq(inputDF.columns:*).toDF(“columns”).where(“columns rlike”“+lookForValue+”)).rdd.map(x=>x.mkString)。收集
//迭代器
变量迭代器=1
//需要此命令才能运行以下命令
var runStatement=Array(“”)
//运行,直到命中所有列
while(迭代器操作->中间-最终允许中间和最终是半动态的,并允许垂直和水平目录保持在spark中
interimDF=if(迭代器==1){
行动
}否则{
最终联合行动(actionDF)
}
finalDF=interimDF
迭代器=迭代器+1
}
如果您的初始数据帧被引用为df
,并使用以下模式:
df.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks__0__failed_at: string (nullable = true)
|-- webhooks__0__status: string (nullable = true)
|-- webhooks__0__updated_at: string (nullable = true)
|-- webhooks__1__failed_at: string (nullable = true)
|-- webhooks__1__updated_at: string (nullable = true)
|-- webhooks__2__failed_at: string (nullable = true)
|-- webhooks__2__updated_at: string (nullable = true)
|-- webhooks__3__failed_at: string (nullable = true)
|-- webhooks__3__updated_at: string (nullable = true)
|-- webhooks__5__failed_at: string (nullable = true)
|-- webhooks__5__updated_at: string (nullable = true)
df_step1.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- index: string (nullable = false)
| | |-- failed_at: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- updated_at: string (nullable = true)
只需操作列名表达式,就可以将所有webhook数据重新组合到一个struct数组中,并且可以使用lit
spark函数将列名作为值插入结果数据集中
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import df.sparkSession.implicits._
val (webhooks_columns, base_columns) = df.columns.partition(_.startsWith("webhooks"))
val parsed_webhooks_columns = webhooks_columns
.map(_.split("__"))
.map { case Array(_: String, idx: String, f: String) => (idx, f) }
val all_fields = parsed_webhooks_columns.map(_._2).toSet
val webhooks_structs = parsed_webhooks_columns
.groupBy(_._1)
.map(t => {
val fields = t._2.map(_._2)
val all_struct_fields =
Seq(lit(t._1).as("index")) ++
all_fields.map { f =>
if (fields.contains(f))
col(s"webhooks__${t._1}__${f}").as(f)
else
lit(null).cast(StringType).as(f)
}
struct(all_struct_fields:_*)
}).toArray
val df_step1 = df.select(base_columns.map(col) ++
Seq(array(webhooks_structs:_*).as("webhooks")):_*)
上面代码中的大部分复杂性都涉及到这样一个事实:根据webhook索引,字段的数量会有所不同(索引0有一个在其他索引中找不到的状态字段),并且需要确保所有结构都具有完全相同的列,具有相同的类型,并且以相同的顺序进行转换
您将得到以下模式:
df.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks__0__failed_at: string (nullable = true)
|-- webhooks__0__status: string (nullable = true)
|-- webhooks__0__updated_at: string (nullable = true)
|-- webhooks__1__failed_at: string (nullable = true)
|-- webhooks__1__updated_at: string (nullable = true)
|-- webhooks__2__failed_at: string (nullable = true)
|-- webhooks__2__updated_at: string (nullable = true)
|-- webhooks__3__failed_at: string (nullable = true)
|-- webhooks__3__updated_at: string (nullable = true)
|-- webhooks__5__failed_at: string (nullable = true)
|-- webhooks__5__updated_at: string (nullable = true)
df_step1.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- index: string (nullable = false)
| | |-- failed_at: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- updated_at: string (nullable = true)
现在,您可以分解数据集,将不同的Webhook拆分为单独的行
val df_step2 = df_step1.withColumn("webhook", explode('webhooks)).drop("webhooks")
您将得到以下模式
df_step2.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhook: struct (nullable = false)
| |-- index: string (nullable = false)
| |-- failed_at: string (nullable = true)
| |-- status: string (nullable = true)
| |-- updated_at: string (nullable = true)
然后可以选择展平数据集以简化最终模式
val df_step2_flattened = df_step2.schema
.filter(_.name == "webhook")
.flatMap(_.dataType.asInstanceOf[StructType])
.map(f => (s"webhook_${f.name}", 'webhook(f.name)))
.foldLeft(df_step2) { case (df, (colname, colspec)) => df.withColumn(colname, colspec) }
.drop("webhook")
此时,您可能希望筛选出在处使用null webhook_updated_的行,并运行所需的任何聚合
现在,您的最终架构是:
df_step2_flattened.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhook_index: string (nullable = false)
|-- webhook_failed_at: string (nullable = true)
|-- webhook_status: string (nullable = true)
|-- webhook_updated_at: string (nullable = true)
这不是实现您想要的功能的唯一方法,但上述方法的主要优点是,它只使用内置的Spark表达式和函数,因此可以充分利用所有catalyst引擎优化。如果您的初始数据帧被引用为具有以下模式的
df
:
df.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks__0__failed_at: string (nullable = true)
|-- webhooks__0__status: string (nullable = true)
|-- webhooks__0__updated_at: string (nullable = true)
|-- webhooks__1__failed_at: string (nullable = true)
|-- webhooks__1__updated_at: string (nullable = true)
|-- webhooks__2__failed_at: string (nullable = true)
|-- webhooks__2__updated_at: string (nullable = true)
|-- webhooks__3__failed_at: string (nullable = true)
|-- webhooks__3__updated_at: string (nullable = true)
|-- webhooks__5__failed_at: string (nullable = true)
|-- webhooks__5__updated_at: string (nullable = true)
df_step1.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- index: string (nullable = false)
| | |-- failed_at: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- updated_at: string (nullable = true)
只需操作列名表达式,就可以将所有webhook数据重新组合到一个struct数组中,并且可以使用lit
spark函数将列名作为值插入结果数据集中
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import df.sparkSession.implicits._
val (webhooks_columns, base_columns) = df.columns.partition(_.startsWith("webhooks"))
val parsed_webhooks_columns = webhooks_columns
.map(_.split("__"))
.map { case Array(_: String, idx: String, f: String) => (idx, f) }
val all_fields = parsed_webhooks_columns.map(_._2).toSet
val webhooks_structs = parsed_webhooks_columns
.groupBy(_._1)
.map(t => {
val fields = t._2.map(_._2)
val all_struct_fields =
Seq(lit(t._1).as("index")) ++
all_fields.map { f =>
if (fields.contains(f))
col(s"webhooks__${t._1}__${f}").as(f)
else
lit(null).cast(StringType).as(f)
}
struct(all_struct_fields:_*)
}).toArray
val df_step1 = df.select(base_columns.map(col) ++
Seq(array(webhooks_structs:_*).as("webhooks")):_*)
上面代码中的大部分复杂性都涉及到这样一个事实:根据webhook索引,字段的数量会有所不同(索引0有一个在其他索引中找不到的状态字段),并且需要确保所有结构都具有完全相同的列,具有相同的类型,并且以相同的顺序进行转换
您将得到以下模式:
df.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks__0__failed_at: string (nullable = true)
|-- webhooks__0__status: string (nullable = true)
|-- webhooks__0__updated_at: string (nullable = true)
|-- webhooks__1__failed_at: string (nullable = true)
|-- webhooks__1__updated_at: string (nullable = true)
|-- webhooks__2__failed_at: string (nullable = true)
|-- webhooks__2__updated_at: string (nullable = true)
|-- webhooks__3__failed_at: string (nullable = true)
|-- webhooks__3__updated_at: string (nullable = true)
|-- webhooks__5__failed_at: string (nullable = true)
|-- webhooks__5__updated_at: string (nullable = true)
df_step1.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- index: string (nullable = false)
| | |-- failed_at: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- updated_at: string (nullable = true)
现在,您可以分解数据集,将不同的Webhook拆分为单独的行
val df_step2 = df_step1.withColumn("webhook", explode('webhooks)).drop("webhooks")
您将得到以下模式
df_step2.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhook: struct (nullable = false)
| |-- index: string (nullable = false)
| |-- failed_at: string (nullable = true)
| |-- status: string (nullable = true)
| |-- updated_at: string (nullable = true)
然后可以选择展平数据集以简化最终模式
val df_step2_flattened = df_step2.schema
.filter(_.name == "webhook")
.flatMap(_.dataType.asInstanceOf[StructType])
.map(f => (s"webhook_${f.name}", 'webhook(f.name)))
.foldLeft(df_step2) { case (df, (colname, colspec)) => df.withColumn(colname, colspec) }
.drop("webhook")
此时,您可能希望筛选出在处使用null webhook_updated_的行,并运行所需的任何聚合
现在,您的最终架构是:
df_step2_flattened.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhook_index: string (nullable = false)
|-- webhook_failed_at: string (nullable = true)
|-- webhook_status: string (nullable = true)
|-- webhook_updated_at: string (nullable = true)
这不是实现您想要的唯一方法,但上述方法的主要优点是,它只使用内置的Spark表达式和函数,因此可以充分利用所有catalyst引擎优化。您迄今为止所尝试的方法。这是完全可能的。您可以使用“生成不同的数据帧”,只使用所需的列基于索引,最后将所有列合并在一起,然后将其分组。如果列是静态的,则很容易。对于动态列数