Scala 为少数列创建具有空值的DataFrame
我正在尝试使用Scala 为少数列创建具有空值的DataFrame,scala,apache-spark,spark-dataframe,apache-spark-dataset,Scala,Apache Spark,Spark Dataframe,Apache Spark Dataset,我正在尝试使用RDD创建一个DataFrame 首先,我使用下面的代码创建一个RDD- val account = sc.parallelize(Seq( (1, null, 2,"F"), (2, 2, 4, "F"), (3, 3, 6, "N"),
RDD
创建一个DataFrame
首先,我使用下面的代码创建一个RDD
-
val account = sc.parallelize(Seq(
(1, null, 2,"F"),
(2, 2, 4, "F"),
(3, 3, 6, "N"),
(4,null,8,"F")))
工作正常-
帐户:org.apache.spark.rdd.rdd[(Int,Any,Int,String)]=
ParallelCollectionRDD[0]在:27处并行化
但是当尝试使用下面的代码从RDD
创建DataFrame
时
account.toDF("ACCT_ID", "M_CD", "C_CD","IND")
我正在犯错误
java.lang.UnsupportedOperationException:不支持任何类型的架构
支持
我分析说,每当我在Seq
中输入null
值时,只有我得到了错误
有没有办法添加null值?问题是any类型太通用,Spark根本不知道如何序列化它。您应该显式地提供一些特定类型,在您的示例中是
Integer
。因为在Scala中不能将null赋值给基元类型,所以可以使用java.lang.Integer
。所以试试这个:
val account = sc.parallelize(Seq(
(1, null.asInstanceOf[Integer], 2,"F"),
(2, new Integer(2), 4, "F"),
(3, new Integer(3), 6, "N"),
(4, null.asInstanceOf[Integer],8,"F")))
以下是一个输出:
rdd: org.apache.spark.rdd.RDD[(Int, Integer, Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24
也可以考虑一些更干净的方法来声明null整数值,例如:
object Constants {
val NullInteger: java.lang.Integer = null
}
不使用RDD的替代方法:
import spark.implicits._
val df = spark.createDataFrame(Seq(
(1, None, 2, "F"),
(2, Some(2), 4, "F"),
(3, Some(3), 6, "N"),
(4, None, 8, "F")
)).toDF("ACCT_ID", "M_CD", "C_CD","IND")
df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1|null| 2| F|
| 2| 2| 4| F|
| 3| 3| 6| N|
| 4|null| 8| F|
+-------+----+----+---+
df.printSchema
root
|-- ACCT_ID: integer (nullable = false)
|-- M_CD: integer (nullable = true)
|-- C_CD: integer (nullable = false)
|-- IND: string (nullable = true)
使用
(1,null:Integer,2,“F”)
如果我使用case class
创建DataFrame
,也就是说,我使用spark.sparkContext.parallellize(Seq(A(u,u,,,,))创建DataFrame
,我在哪里有case class A(,u,u)
?我已经尝试过上述技术,但是null。asInstanceOf[T]
给了我NullPointerException
,null:T
(正如在对问题的评论中所说的)给了我一个null类型的表达式不适合隐式转换
import spark.implicits._
val df = spark.createDataFrame(Seq(
(1, None, 2, "F"),
(2, Some(2), 4, "F"),
(3, Some(3), 6, "N"),
(4, None, 8, "F")
)).toDF("ACCT_ID", "M_CD", "C_CD","IND")
df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1|null| 2| F|
| 2| 2| 4| F|
| 3| 3| 6| N|
| 4|null| 8| F|
+-------+----+----+---+
df.printSchema
root
|-- ACCT_ID: integer (nullable = false)
|-- M_CD: integer (nullable = true)
|-- C_CD: integer (nullable = false)
|-- IND: string (nullable = true)