Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/17.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 无法将RDD[行]转换为数据帧_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 无法将RDD[行]转换为数据帧

Scala 无法将RDD[行]转换为数据帧,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,对于以下代码-其中数据帧转换为RDD[Row],新列的数据通过mapPartitions追加: // df is a DataFrame val dfRdd = df.rdd.mapPartitions { val bfMap = df.rdd.sparkContext.broadcast(factorsMap) iter => val locMap = bfMap.value iter.map { r => val newseq = r.toS

对于以下代码-其中数据帧转换为
RDD[Row]
,新列的数据通过
mapPartitions
追加:

 // df is a DataFrame
val dfRdd = df.rdd.mapPartitions {
  val bfMap = df.rdd.sparkContext.broadcast(factorsMap)
  iter =>
    val locMap = bfMap.value
    iter.map { r =>
      val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
      Row(newseq)
    }
}
对于带有另一列的
RDD[Row]
,输出是正确的:

println("**dfrdd\n" + dfRdd.take(5).mkString("\n"))

**dfrdd
[ArrayBuffer(0021BEC286CC, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 148818)]
[ArrayBuffer(0021BEE7C556, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 26908)]
[ArrayBuffer(8C7F3BFD4B82, 4, Series, series, bc514da3e0d534da8207e3aab231d1cb, livetv, 99942)]
[ArrayBuffer(0021BEC8F8B8, 1, Series, series, 0d2debc63efa3790a444c7959249712b, livetv, 53994)]
[ArrayBuffer(10EA59F10C8B, 1, Series, series, 0d2debc63efa3790a444c7959249712b, livetv, 1427)]
让我们尝试将
RDD[Row]
转换回数据帧:

val newSchema = df.schema.add(StructField("userf",IntegerType))
val df2 = df.sqlContext.createDataFrame(dfRdd,newSchema)
现在,让我们创建更新的数据帧:

val newSchema = df.schema.add(StructField("userf",IntegerType))
val df2 = df.sqlContext.createDataFrame(dfRdd,newSchema)
新模式看起来正确吗

newSchema.show()

root
 |-- user: string (nullable = true)
 |-- score: long (nullable = true)
 |-- programType: string (nullable = true)
 |-- source: string (nullable = true)
 |-- item: string (nullable = true)
 |-- playType: string (nullable = true)
 |-- userf: integer (nullable = true)
请注意,我们确实看到了新的
userf

但是,它不起作用:

println("df2: " + df2.take(1))

Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, 
most recent failure: Lost task 0.0 in stage 9.0 (TID 9, localhost, executor driver): java.lang.RuntimeException: Error while encoding: 

java.lang.RuntimeException: scala.collection.mutable.ArrayBuffer is not a  
 valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, user), StringType), true) AS user#28
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, user), StringType), true)
   :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
   :  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
   :  :  +- input[0, org.apache.spark.sql.Row, true]
   :  +- 0
   :- null
那么:这里缺少什么细节

注:我对不同的方法不感兴趣:例如
带列
数据集
。。让我们只考虑方法:

  • 转换为RDD
  • 向每行添加新的数据元素
  • 更新新列的架构
  • 将新的RDD+模式转换回DataFrame

    • 调用
      行的构造函数时似乎出现了一个小错误:

      val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
      Row(newseq)
      
      这个“构造函数”(实际上是应用方法)的签名是:

      当您传递
      Seq[Any]
      时,它被视为
      Seq[Any]
      类型的单个值。您希望传递此序列的元素,因此应使用:

      val newseq = r.toSeq :+ locMap(r.getAs[String](inColName))
      Row(newseq: _*)
      

      修复后,行将与您构建的模式匹配,您将获得预期的结果

      你说得对!现在你可以在代表赛中领先我了!非常感谢你!我对scala和spark还很陌生,这让我花了很长时间才发现。还有一些关于
      :*
      注释的更多信息,我刚刚遇到了这个问答锯,我投了更高的票。。然后。。我就是那个询问者!对不起,我不能再投票了