Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark-在循环中加载表中的数据帧内容_Scala_Apache Spark_Hive - Fatal编程技术网

Scala Spark-在循环中加载表中的数据帧内容

Scala Spark-在循环中加载表中的数据帧内容,scala,apache-spark,hive,Scala,Apache Spark,Hive,我使用scala/spark将数据插入蜂巢拼花表,如下所示 for(*lots of current_Period_Id*){//This loop is on a result of another query that returns multiple rows of current_Period_Id val myDf = hiveContext.sql(s"""SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_P

我使用scala/spark将数据插入蜂巢拼花表,如下所示

for(*lots of current_Period_Id*){//This loop is on a result of another query that returns multiple rows of current_Period_Id
  val myDf = hiveContext.sql(s"""SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
  val count: Int = myDf.count().toInt
  if(count>0){
    hiveContext.sql(s"""INSERT INTO destinationtable PARTITION(period_id=$current_Period_Id) SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
  }
}
由于select语句执行了两次,因此这种方法需要花费大量时间来完成

我试图避免两次选择数据,我想到的一种方法是将dataframemyDf直接写入表中

这就是我试图使用的代码的要点

val sparkConf = new SparkConf().setAppName("myApp")
                             .set("spark.yarn.executor.memoryOverhead","4096")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)

hiveContext.setConf("hive.exec.dynamic.partition","true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
for(*lots of current_Period_Id*){//This loop is on a result of another query
  val myDf = hiveContext.sql("SELECT COLUMNS FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id")
  val count: Int = myDf.count().toInt
  if(count>0){
    myDf.write.mode("append").format("parquet").partitionBy("PERIOD_ID").saveAsTable("destinationtable")
  }
}
但是我在myDf.write部分得到一个错误

目标表按时段id进行分区

有人能帮我吗


我使用的spark版本是1.5.0-cdh5.5.2

数据帧架构和表的描述彼此不同。周期_ID!=句点id列名在DF中是大写的,但在表中是大写的。请尝试在sql中使用小写句点_id

在hive中为
myDf.schema
descripe destinationtable
输出什么?@FaigB很抱歉回复太晚。我在spark shell org.apache.spark.sql.types.StructType=StructType(StructField(ac_name,StringType,true),StructField(ac_time,StringType,true),StructField(ac_hhold,StringType,true),StructField(ac_条形码,StringType,true),StructField(nc_nan,DoubleType,true)…对于描述目标表,我得到了ac_名称字符串ac_时间字符串ac_hhold字符串ac_店铺字符串ac_条形码字符串nc_nan_密钥double@FaigB续.
#分区信息#列名称数据#类型注释周期#idbigint
至于myDF.schema,它表明句点id在数据帧模式中不存在,因此产生了例外情况。确切的结果是
myDF.schema StructType(StructField(ac_名称,StringType,true)、StructField(ac_时间,StringType,true)、StructField(ac_hhold,StringType,true)、StructField(ac_商店,StringType,true),StructField(条形码,StringType,true),StructField(man_key,DoubleType,true),StructField(trip_no,LongType,true),StructField(DT_PUR_日期,StringType,true),StructField(_c141,DoubleType,false),StructField(句号,LongType,true))
contd。。
java.util.NoSuchElementException: key not found: period_id