Apache spark 在spark中添加静态标题和拖车信息
我有一个数据帧,它被写入输出文件夹位置,文件分隔符作为管道分隔符。在编写之前,我需要在现有的数据帧中附加头和尾 实际有效载荷:Apache spark 在spark中添加静态标题和拖车信息,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有一个数据帧,它被写入输出文件夹位置,文件分隔符作为管道分隔符。在编写之前,我需要在现有的数据帧中附加头和尾 实际有效载荷: +--------------------+---+---+---+---+----+----+----------+---+ | _1| _2| _3| _4| _5| _6| _7| _8| _9| +--------------------+---+---+---+---+----+----+----------
+--------------------+---+---+---+---+----+----+----------+---+
| _1| _2| _3| _4| _5| _6| _7| _8| _9|
+--------------------+---+---+---+---+----+----+----------+---+
|chevrolet chevell...| 18| 8|307|130|3504|12.0|1970-01-01|USA|
| buick skylark 320| 15| 8|350|165|3693|11.5|1970-01-01|USA|
| plymouth satellite| 18| 8|318|150|3436|11.0|1970-01-01|USA|
| amc rebel sst| 16| 8|304|150|3433|12.0|1970-01-01|USA|
| ford torino| 17| 8|302|140|3449|10.5|1970-01-01|USA|
| ford galaxie 500| 15| 8|429|198|4341|10.0|1970-01-01|USA|
| chevrolet impala| 14| 8|454|220|4354| 9.0|1970-01-01|USA|
| plymouth fury iii| 14| 8|440|215|4312| 8.5|1970-01-01|USA|
| pontiac catalina| 14| 8|455|225|4425|10.0|1970-01-01|USA|
| amc ambassador dpl| 15| 8|390|190|3850| 8.5|1970-01-01|USA|
+--------------------+---+---+---+---+----+----+----------+---+
标题
+-------+----------+-------+---+
| _1| _2| _3| _4|
+-------+----------+-------+---+
|Samsung|Galaxy S10|Android| 12|
+-------+----------+-------+---+
页脚:
+----+---+----------+---+
| _1| _2| _3| _4|
+----+---+----------+---+
|alex| 25|California| US|
+----+---+----------+---+
负载中的列的大小不一定等于页眉和页脚的列的大小。我已将所有数据帧转换为rdd,如下所示
val payloadRDD = payload.rdd
val headerRDD = header.rdd
val trailerRDD = trailer.rdd
val resultRDD = spark.sparkContext.union(headerRDD,payloadRDD,trailerRDD).collect()
然后,我执行了所有三个rdd的并集,如下所示
val payloadRDD = payload.rdd
val headerRDD = header.rdd
val trailerRDD = trailer.rdd
val resultRDD = spark.sparkContext.union(headerRDD,payloadRDD,trailerRDD).collect()
在将以下数据帧写入磁盘之前,我无法将其转换为数据帧。Union
只能在列数相同的表上执行
您可以将类型为NullType
的缺失列追加到union
之前
def unionFrames(dfs: Seq[DataFrame]): DataFrame = {
dfs match {
case Nil => session.emptyDataFrame // or throw an exception?
case x :: Nil => x
case _ =>
//Preserving Column order from left to right DF's column order
val allColumns = dfs.foldLeft(collection.mutable.ArrayBuffer.empty[String])((a, b) => a ++ b.columns).distinct
val appendMissingColumns = (df: DataFrame) => {
val columns = df.columns.toSet
df.select(allColumns.map(c => if (columns.contains(c)) col(c) else lit(null).as(c)): _*)
}
dfs.tail.foldLeft(appendMissingColumns(dfs.head))((a, b) => a.union(appendMissingColumns(b)))
}
注意:您不需要将DataFrame
转换为RDD
,而是直接对DataFrame
执行union
。union
只能对列数相同的表执行
您可以将类型为NullType
的缺失列追加到union
之前
def unionFrames(dfs: Seq[DataFrame]): DataFrame = {
dfs match {
case Nil => session.emptyDataFrame // or throw an exception?
case x :: Nil => x
case _ =>
//Preserving Column order from left to right DF's column order
val allColumns = dfs.foldLeft(collection.mutable.ArrayBuffer.empty[String])((a, b) => a ++ b.columns).distinct
val appendMissingColumns = (df: DataFrame) => {
val columns = df.columns.toSet
df.select(allColumns.map(c => if (columns.contains(c)) col(c) else lit(null).as(c)): _*)
}
dfs.tail.foldLeft(appendMissingColumns(dfs.head))((a, b) => a.union(appendMissingColumns(b)))
}
注意:您不需要将
DataFrame
转换为RDD
,而是直接对DataFrame
执行union
。如果行大小不同,则无法将其转换为DataFrame