Scala 如何替换向量列中的空值?
我有一个[vector]类型的列,其中有空值,我无法消除,下面是一个示例Scala 如何替换向量列中的空值?,scala,apache-spark,apache-spark-sql,apache-spark-1.6,Scala,Apache Spark,Apache Spark Sql,Apache Spark 1.6,我有一个[vector]类型的列,其中有空值,我无法消除,下面是一个示例 import org.apache.spark.mllib.linalg.Vectors val sv1: Vector = Vectors.sparse(58, Array(8, 45), Array(1.0, 1.0)) val df_1 = sc.parallelize(List(("id_1", sv1))).toDF("id", "feature_vector") val df_2 = sc.paralleli
import org.apache.spark.mllib.linalg.Vectors
val sv1: Vector = Vectors.sparse(58, Array(8, 45), Array(1.0, 1.0))
val df_1 = sc.parallelize(List(("id_1", sv1))).toDF("id", "feature_vector")
val df_2 = sc.parallelize(List(("id_1", 10.0), ("id_2", 10.0))).toDF("id", "numeric_feature")
val df_joined = df_1.join(df_2, Seq("id"), "right")
df_joined.show()
+----+--------------------+---------------+
| id| feature_vector|numeric_feature|
+----+--------------------+---------------+
|id_1|(58,[8,45],[1.0,1...| 10.0|
|id_2| null| 10.0|
+----+--------------------+---------------+
我想做的是:
val map = Map("feature_vector" -> sv1)
val result = df_joined.na.fill(map)
但这带来了一个错误:
Message: Unsupported value type org.apache.spark.mllib.linalg.SparseVector ((58,[8,45],[1.0,1.0])).
我尝试过的其他事情:
df_joined.withColumn("feature_vector", when(col("feature_vector").isNull, sv1).otherwise(sv1)).show
从
我正在努力寻找一个能在Spark 1.6上运行的解决方案,Coalesce和join应该可以做到这一点
import org.apache.spark.sql.functions.{coalesce, broadcast}
val fill = Seq(
Tuple1(Vectors.sparse(58, Array(8, 45), Array(1.0, 1.0)))
).toDF("fill")
df_joined
.join(broadcast(fill))
.withColumn("feature_vector", coalesce($"feature_vector", $"fill"))
.drop("fill")
如果您愿意,您可以在此处获得RDD的帮助:
val naFillRDD = df_joined.map{ r => r match{
case Row(id, feature_vector: Vector, numeric_feature ) => Row(id, feature_vector, numeric_feature )
case Row(id, _, numeric_feature) => Row(id, sv1, numeric_feature)
}}
然后切换回数据帧:
val naFillDF = sqlContext.createDataFrame(naFillRDD, df_joined.schema)
naFillDF.show(false)
//+----+---------------------+---------------+
//|id |feature_vector |numeric_feature|
//+----+---------------------+---------------+
//|id_1|(58,[8,45],[1.0,1.0])|10.0 |
//|id_2|(58,[8,45],[1.0,1.0])|10.0 |
//+----+---------------------+---------------+
让你的问题雪上加霜的是,我不认为你能从1.6中的UDF返回向量。@philantrovert我想我在一次尝试中也碰到了那堵墙。幸运的是,user8371915的建议奏效了!@user8371915的答案肯定更好,不需要在RDD和DF之间切换。请接受这一点。@Philantrove我的不好,出于某种原因,我认为你可以接受多种解决方案。非常感谢。在Spark>2.X中,您需要使用交叉连接而不是连接