Scala 如何优化spark函数以将空值替换为零?
下面是我的Spark函数,它处理数据帧列中的空值,而不考虑其数据类型Scala 如何优化spark函数以将空值替换为零?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,下面是我的Spark函数,它处理数据帧列中的空值,而不考虑其数据类型 def nullsToZero(df:DataFrame,nullsToZeroColsList:Array[String]): DataFrame ={ var y:DataFrame = df for(colDF <- y.columns){ if(nullsToZeroColsList.contains(colDF)){ y = y.withColumn(colDF,
def nullsToZero(df:DataFrame,nullsToZeroColsList:Array[String]): DataFrame ={
var y:DataFrame = df
for(colDF <- y.columns){
if(nullsToZeroColsList.contains(colDF)){
y = y.withColumn(colDF,expr("case when "+colDF+" IS NULL THEN 0 ELSE "+colDF+" end"))
}
}
return y
}
import spark.implicits._
val personDF = Seq(
("miguel", Some(12),100,110,120), (null, Some(22),200,210,220), ("blu", None,300,310,320)
).toDF("name", "age","number1","number2","number3")
println("Print Schema")
personDF.printSchema()
println("Show Original DF")
personDF.show(false)
val myColsList:Array[String] = Array("name","age","age")
println("NULLS TO ZERO")
println("Show NullsToZeroDF")
val fixedDF = nullsToZero(personDF,myColsList)
有没有更优化的方法来编写此函数,以及执行.withColumn()并一次又一次地重新分配DF的意义是什么?
提前谢谢。我建议为组装一个
valueMap
,根据数据类型用特定值填充null
列,如下所示:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(Some(1), Some("a"), Some("x"), None),
(None, Some("b"), Some("y"), Some(20.0)),
(Some(3), None, Some("z"), Some(30.0))
).toDF("c1", "c2", "c3", "c4")
val nullColList = List("c1", "c2", "c4")
val valueMap = df.dtypes.filter(x => nullColList.contains(x._1)).
collect{ case (c, t) => t match {
case "StringType" => (c, "n/a")
case "IntegerType" => (c, 0)
case "DoubleType" => (c, Double.MinValue)
} }.toMap
// valueMap: scala.collection.immutable.Map[String,Any] =
// Map(c1 -> 0, c2 -> n/a, c4 -> -1.7976931348623157E308)
df.na.fill(valueMap).show
// +---+---+---+--------------------+
// | c1| c2| c3| c4|
// +---+---+---+--------------------+
// | 1| a| x|-1.79769313486231...|
// | 0| b| y| 20.0|
// | 3|n/a| z| 30.0|
// +---+---+---+--------------------+
这真是太好了,Leo,这是一个需要处理所有列的DF的完美例子,但在我的例子中,我不需要处理给定DF的所有列,只需要处理我需要担心的列列表。我仍然可以使用上述方法,创建val x=DF diff listocolumns,然后在listocolumns上执行上述操作,并将DF添加回原始DF。我在问题中的做法是否会影响性能?@Pavan_Obj,请参阅处理选定列的修订解决方案。我希望通过
na进行一次转换。fill
比通过with column
进行多次转换更有效。写得漂亮,我已经测试了运行了40-50分钟的代码,我将使用此更改运行--conf spark。谢谢,Leo,这也帮助我做了如下类似的事情val myNewMap:Map[String,Any]=Map(“someStringTypeCol”->null,“somesinttypecol”->null,“someStringTypeCol”->0)
以备不时之需。填写上面的类似内容。
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(Some(1), Some("a"), Some("x"), None),
(None, Some("b"), Some("y"), Some(20.0)),
(Some(3), None, Some("z"), Some(30.0))
).toDF("c1", "c2", "c3", "c4")
val nullColList = List("c1", "c2", "c4")
val valueMap = df.dtypes.filter(x => nullColList.contains(x._1)).
collect{ case (c, t) => t match {
case "StringType" => (c, "n/a")
case "IntegerType" => (c, 0)
case "DoubleType" => (c, Double.MinValue)
} }.toMap
// valueMap: scala.collection.immutable.Map[String,Any] =
// Map(c1 -> 0, c2 -> n/a, c4 -> -1.7976931348623157E308)
df.na.fill(valueMap).show
// +---+---+---+--------------------+
// | c1| c2| c3| c4|
// +---+---+---+--------------------+
// | 1| a| x|-1.79769313486231...|
// | 0| b| y| 20.0|
// | 3|n/a| z| 30.0|
// +---+---+---+--------------------+