Scala Spark-在循环中加载表中的数据帧内容_Scala_Apache Spark_Hive

Scala Spark-在循环中加载表中的数据帧内容

scala apache-spark hive

Scala Spark-在循环中加载表中的数据帧内容,scala,apache-spark,hive,Scala,Apache Spark,Hive,我使用scala/spark将数据插入蜂巢拼花表，如下所示 for(*lots of current_Period_Id*){//This loop is on a result of another query that returns multiple rows of current_Period_Id val myDf = hiveContext.sql(s"""SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_P

我使用scala/spark将数据插入蜂巢拼花表，如下所示

for(*lots of current_Period_Id*){//This loop is on a result of another query that returns multiple rows of current_Period_Id
  val myDf = hiveContext.sql(s"""SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
  val count: Int = myDf.count().toInt
  if(count>0){
    hiveContext.sql(s"""INSERT INTO destinationtable PARTITION(period_id=$current_Period_Id) SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
  }
}

由于select语句执行了两次，因此这种方法需要花费大量时间来完成

我试图避免两次选择数据，我想到的一种方法是将dataframemyDf直接写入表中

这就是我试图使用的代码的要点

val sparkConf = new SparkConf().setAppName("myApp")
                             .set("spark.yarn.executor.memoryOverhead","4096")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)

hiveContext.setConf("hive.exec.dynamic.partition","true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
for(*lots of current_Period_Id*){//This loop is on a result of another query
  val myDf = hiveContext.sql("SELECT COLUMNS FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id")
  val count: Int = myDf.count().toInt
  if(count>0){
    myDf.write.mode("append").format("parquet").partitionBy("PERIOD_ID").saveAsTable("destinationtable")
  }
}

但是我在myDf.write部分得到一个错误

目标表按时段id进行分区

有人能帮我吗

我使用的spark版本是1.5.0-cdh5.5.2

数据帧架构和表的描述彼此不同。周期_ID！=句点id列名在DF中是大写的，但在表中是大写的。请尝试在sql中使用小写句点_id

在hive中为

myDf.schema

和

descripe destinationtable

输出什么？@FaigB很抱歉回复太晚。我在spark shell org.apache.spark.sql.types.StructType=StructType（StructField（ac_name，StringType，true），StructField（ac_time，StringType，true），StructField（ac_hhold，StringType，true），StructField（ac_条形码，StringType，true），StructField（nc_nan，DoubleType，true）…对于描述目标表，我得到了ac_名称字符串ac_时间字符串ac_hhold字符串ac_店铺字符串ac_条形码字符串nc_nan_密钥double@FaigB续.

#分区信息#列名称数据#类型注释周期#idbigint

至于myDF.schema，它表明句点id在数据帧模式中不存在，因此产生了例外情况。确切的结果是

myDF.schema StructType（StructField（ac_名称，StringType，true）、StructField（ac_时间，StringType，true）、StructField（ac_hhold，StringType，true）、StructField（ac_商店，StringType，true），StructField（条形码，StringType，true），StructField（man_key，DoubleType，true），StructField（trip_no，LongType，true），StructField（DT_PUR_日期，StringType，true），StructField（_c141，DoubleType，false），StructField（句号，LongType，true））

contd。。

java.util.NoSuchElementException: key not found: period_id