Scala 如何参数化将数据帧写入配置单元表
我在RBDM中有一个表列表(跨不同类别),我希望提取并保存在配置单元中,并且我希望以这样一种方式进行参数化,以便能够将类别名称附加到配置单元中的输出位置。 例如,我有一个类别“employee”,我希望能够以“hive\u db.employee\u some\u other\u random\u name”格式保存从RDBMS提取的表 我有如下代码Scala 如何参数化将数据帧写入配置单元表,scala,apache-spark,apache-spark-sql,spark-jdbc,Scala,Apache Spark,Apache Spark Sql,Spark Jdbc,我在RBDM中有一个表列表(跨不同类别),我希望提取并保存在配置单元中,并且我希望以这样一种方式进行参数化,以便能够将类别名称附加到配置单元中的输出位置。 例如,我有一个类别“employee”,我希望能够以“hive\u db.employee\u some\u other\u random\u name”格式保存从RDBMS提取的表 我有如下代码 val category = "employee" val tableList = List("sc
val category = "employee"
val tableList = List("schema.table_1", "schema.table_2", "schema.table_3")
val tableMap = Map("schema.table_1" -> "table_1",
"schema.table_2" -> "table_2",
"schema.table_3" -> "table_3")
val queryMap = Map("table_1" -> (select * from table_1) tble,
"table_2" -> (select * from table_2) tble,
"table_3" -> (select * from table_3) tble)
val tableBucketMap = Map("table_1" -> "bucketBy(80,\"EMPLOY_ID\",\"EMPLOYE_ST\").sortBy(\"EMPLOY_ST\").format(\"parquet\")",
"table_2" -> "bucketBy(80, \"EMPLOY_ID\").sortBy(\"EMPLOY_ID\").format(\"parquet\")",
"table_3" -> "bucketBy(80, \"EMPLOY_ID\", \"SAL_ID\", \"DEPTS_ID\").sortBy(\"EMPLOY_ID\").format(\"parquet\")")
for (table <- tableList){
val tableName = tableMap(table)
val print_start = "STARTING THE EXTRACTION PROCESSING FOR TABLE: %s"
val print_statement = print_start.format(tableName)
println(print_statement)
val extract_query = queryMap(table)
val query_statement_non = "Query to extract table %s is: "
val query_statement = query_statement_non.format(tableName)
println(query_statement + extract_query)
val extracted_table = spark.read.format("jdbc")
.option("url", jdbcURL)
.option("driver", driver_type)
.option("dbtable", extract_query)
.option("user", username)
.option("password", password)
.option("fetchsize", "20000")
.option("queryTimeout", "0")
.load()
extracted_table.show(5, false)
//saving extracted table in hive
val tableBucket = tableBucketMap(table)
val output_loc = "hive_db.%s_table_extracted_for_%s"
val hive_location = output_loc.format(category, tableName)
println(hive_location)
val saving_table = "%s.write.%s.saveAsTable(\"%s\")"
saving_table.format(extracted_table, tableBucket, hive_location)
println(saving_table.format(extracted_table, tableBucket, hive_location))
val print_end = "COMPLETED EXTRACTION PROCESS FOR TABLE: %s"
val print_end_statement = print_end.format(tableName)
println(print_end_statement)
它只打印列名,而不是将提取的数据帧写入配置单元
[EMPLOY_ID: int, EMPLOYE_NM: String].write............saveAsTable("hive_db.employee_table_extracted_for_table_1")
如何将DF写入hive表?您能试试这种方法吗,
像这样改变你的桶图(我已经为t1做了,请为t2和t3做同样的事情)
并用足够的参数替换df.bucketBy()
,如(numBuckets:Int,colName:String,colNames:String*)
这种方法将解决上述问题
[EMPLOY_ID: int, EMPLOYE_NM: String].write............saveAsTable("hive_db.employee_table_extracted_for_table_1")
bucketBy(80,\'EMPLOY\u ID\,\'ACP\u ID\,\'DEPTS\u ID\”
这能满足这个bucketBy(numBuckets:Int,colName:String,colNames:String*)
方法谢谢聪明的程序员。我会尝试你建议的方法,并给你反馈。聪明的程序员,我尝试了你建议的方法。我仍然面临同样的问题。Smart_coder,谢谢你,我最终能够使用你建议的方法解决这个问题,并对我的要求进行了一些修改。非常感谢。。
val tableBucketMap = Map("table_1" -> "80,\"employe_st\"")
val stringArr=tableBucket.split(",")
val numBuckets=stringArr(0).toInt
val colName=stringArr(1)
extracted_table.write.mode("append").bucketBy(numBuckets,colName).format("parquet").saveAsTable(hive_location)
[EMPLOY_ID: int, EMPLOYE_NM: String].write............saveAsTable("hive_db.employee_table_extracted_for_table_1")