Scala 如何将数据帧的选定列移动到数据帧的末尾(重新排列列位置)?
我正试图将RDBMS(Greenplum)表摄入蜂巢。我阅读了该表并从中获得了如下数据帧:Scala 如何将数据帧的选定列移动到数据帧的末尾(重新排列列位置)?,scala,apache-spark,dataframe,hive,Scala,Apache Spark,Dataframe,Hive,我正试图将RDBMS(Greenplum)表摄入蜂巢。我阅读了该表并从中获得了如下数据帧: val yearDF = spark.read.format("jdbc").option("url", connectionUrl) .option("dbtable", "(select * from schema.table where source_system_name='DB2' and pe
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", "(select * from schema.table where source_system_name='DB2' and period_year='2017') as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",15)
.load()
上述DF的模式为:
forecast_id:bigint
period_year:numeric(15,0)
period_num:numeric(15,0)
period_name:character varying(15)
source_system_name:character varying(30)
source_record_type:character varying(30)
ptd_balance:numeric
xx_data_hash_id:bigint
xx_pk_id:bigint
为了将上面的数据帧吸收到配置单元中,我将模式放入一个列表中,并将所有greenplum数据类型更改为与配置单元兼容的数据类型。
我有一个映射:dataMapper
,它告诉我们gp的数据类型应该转换成Hive的
class ChangeDataTypes(val gpColumnDetails: List[String], val dataMapper: Map[String, String]) {
val dataMap: Map[String, String] = dataMapper
def gpDetails(): String = {
val hiveDataTypes = gpColumnDetails.map(_.split(":\\s*")).map(s => s(0) + " " + dMap(s(1))).mkString(",")
hiveDataTypes
}
def dMap(gpColType: String): String = {
val patterns = dataMap.keySet
val mkey = patterns.dropWhile{
p => gpColType != p.r.findFirstIn(gpColType).getOrElse("")
}.headOption match {
case Some(p) => p
case None => ""
}
dataMap.getOrElse(mkey, "n/a")
}
}
这些是执行上述代码后的数据类型:
forecast_id:bigint
period_year:bigint
period_num:bigint
period_name:String
source_system_name:String
source_record_type:String
ptd_balance:double
xx_data_hash_id:bigint
xx_pk_id:bigint
由于我的配置单元表是根据源\系统\名称和周期\年动态划分的,我需要通过将列数据:source\u system\u name&period\u year
移动到数据帧的末尾来更改数据帧的内容,因为在数据帧中插入数据时,配置单元表的分区列应该是表的最后一列
谁能告诉我如何将列:source\u system\u name&period\u dataframe的年份:yearDF从当前位置移动到末尾(基本上是重新排列列)?从主列表中提取列,然后在末尾追加并对数据框执行select:
val lastCols = Seq("col1","col2")
val allColOrdered = df.columns.diff(lastCols) ++ lastCols
val allCols = allColOrdered.map(cn => org.apache.spark.sql.functions.col(cn))
val result = df.select(allCols: _*)
它奏效了…非常感谢。。