Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 使用case类将未知列添加为null_Scala_Apache Spark - Fatal编程技术网

Scala 使用case类将未知列添加为null

Scala 使用case类将未知列添加为null,scala,apache-spark,Scala,Apache Spark,我正在创建一个按case类设置的新数据框,该数据框的输入数据框可能比现有数据框的列数更少/不同。我试图使用case类将不存在的值设置为null 我使用这个case类来驱动要创建的新数据帧 输入数据帧incomingDf可能没有上面设置为null的所有变量字段 case class existingSchema(source_key: Int , sequence_number: Int , subsc

我正在创建一个按case类设置的新数据框,该数据框的输入数据框可能比现有数据框的列数更少/不同。我试图使用case类将不存在的值设置为null

我使用这个case类来驱动要创建的新数据帧

输入数据帧incomingDf可能没有上面设置为null的所有变量字段

case class existingSchema(source_key: Int
                        , sequence_number: Int
                        , subscriber_id: String
                        , subscriber_ssn: String
                        , last_name: String
                        , first_name: String
                        , variable1: String = null
                        , variable2: String = null
                        , variable3: String = null
                        , variable4: String = null
                        , variable5: String = null
                        , source_date: Date
                        , load_date: Date
                        , file_name_String: String)

val incomingDf = spark.table("raw.incoming")

val formattedDf = incomingDf.as[existingSchema].toDF()
这会在编译时抛出一个错误

formattedDf的新架构应与案例类existingSchema具有相同的架构

incomingDf.printSchema
编译错误:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
    val formattedDf = incomingDf.as[existingSchema].toDF()                                                                                                                                                                                                                                                                               
                                                     ^                                                                                                                                                                                                                                                                                                 
one error found                                                                                                                                                                                                                                                                                                                                        
 FAILED                                                                                                                                                                                                                                                                                                                                                

FAILURE: Build failed with an exception.                                                                                                                                                                                                                                                                                                               

* What went wrong:                                                                                                                                                                                                                                                                                                                                     
Execution failed for task ':compileScala'.                                                                                                                                                                                                                                                                                                             
> Compilation failed 
更新: 我添加了代码行:

import incomingDf.sparkSession.implicits._
编译也很好

我现在在运行时遇到以下错误:

19/04/17 14:37:56 ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve '`variable2`' given input columns: [variable1, variable3, sequence_number, last_name, first_name, file_name_string, subscriber_id, load_date, source_key];
org.apache.spark.sql.AnalysisException: cannot resolve '`variable2`' given input columns: [variable1, variable3, sequence_number, last_name, first_name, file_name_string, subscriber_id, load_date, source_key];
    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)

您可能需要专门定义DF模式。例如:

import org.apache.spark.sql.types._

val newSchema: StructType = StructType(Array(
  StructField("nested_array", ArrayType(ArrayType(StringType)), true),
  StructField("numbers", IntegerType, true),
  StructField("text", StringType, true)
))

// Given a DataFrame df...
val combinedSchema = StructType(df.schema ++ newSchema)
val resultRDD = ... // here, process df to add rows or whatever and get the result as an RDD
                    // you can get an RDD as simply as df.rdd
val outDf = sparkSession.createDataFrame(resultRDD, combinedSchema)

[StructField][1]参数的第三个成员确保新创建的字段可以为空。默认值为true,因此您实际上不必添加它,但为了清晰起见,我将其包括在内,因为使用此方法的全部目的是创建一个专门为null的字段

现有架构缺少case类的一些字符串字段。您只需显式添加它们:

val formattedDf = Seq("variable2", "variable4", "variable5")
  .foldLeft(incomingDf)((df, col) => {
    df.withColumn(col, lit(null.asInstanceOf[String]))
  }).as[existingSchema].toDF()

更一般的解决方案是推断缺少的字段。

谢谢您的回复。我希望能够保持模式不变,因为我将向现有配置单元表添加分区。您上面提到的combinedSchema在我每次运行它时都将是一个新的schema。在许多情况下,从DataFrame中提取RDD是不可接受的,主要是在基础结构代码中,当过滤器被下推到数据源层时。@shay_u_;根据我的经验,当您开始添加新的特性或功能时,对架构的修改很容易中断。如果您接受一个不确定的输入,我发现最好立即将其设置为一个确定的模式,并在您使用时对输入进行明确的重新分区。我们对此没有争议,但这与返回到RDD无关。@shay_uuu_u_;嗯,我的观点主要来自我最近从1.6转换的代码,你必须回到RDD去做这样的事情;所以这只是一个有效的方法。我没有其他的方法。请运行incomingDf.printSchema并在这里发布,同时在OP中添加异常,包括上面的stack TraceEded它。。。。您是否添加了导入spark.implicits.\如错误所示?感谢您的回复。我添加了implicits导入,我现在得到运行时错误,在OP中更新。您可以从打印的模式中看到,raw.incoming表没有您的case类的模式。谢谢。我要试一试。我推断了缺少的字段名,幸运的是所有缺少的字段都是字符串。我添加了:val targetSchema=ScalaReflection.schemaFor[existingSchema].dataType.asInstanceOf[StructType]val missingFields=targetSchema.fieldNames.diffincomingDf.schema.fieldNames.toSeq以添加缺少的字段。同时更改为litnull.castString,将空值设置为stringtype。
val formattedDf = Seq("variable2", "variable4", "variable5")
  .foldLeft(incomingDf)((df, col) => {
    df.withColumn(col, lit(null.asInstanceOf[String]))
  }).as[existingSchema].toDF()