Scala spark将sparksql转换为rddapi
Spark SQL对我来说非常清楚。然而,我刚刚开始使用spark的RDDAPI。正如所指出的,这应该可以让我摆脱缓慢的洗牌Scala spark将sparksql转换为rddapi,scala,apache-spark,apache-spark-sql,rdd,Scala,Apache Spark,Apache Spark Sql,Rdd,Spark SQL对我来说非常清楚。然而,我刚刚开始使用spark的RDDAPI。正如所指出的,这应该可以让我摆脱缓慢的洗牌 def handleBias(df: DataFrame, colName: String, target: String = this.target) = { val w1 = Window.partitionBy(colName) val w2 = Window.partitionBy(colName, target) df.withColu
def handleBias(df: DataFrame, colName: String, target: String = this.target) = {
val w1 = Window.partitionBy(colName)
val w2 = Window.partitionBy(colName, target)
df.withColumn("cnt_group", count("*").over(w2))
.withColumn("pre2_" + colName, mean(target).over(w1))
.withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D)))
.drop("cnt_group")
}
}
在伪代码中:df foreach列(把手(列)
所以加载了一个最小的数据帧
val input = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
)
val inputDf = input.toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
但未能正确映射
val rdd1_inputDf = inputDf.rdd.flatMap { x => {(0 until x.size).map(idx => (idx, x(idx)))}}
rdd1_inputDf.toDF.show
它失败了
java.lang.ClassNotFoundException: scala.Any
java.lang.ClassNotFoundException: scala.Any
可以分别找到此问题中概述的问题的示例。当您在
数据帧上调用.rdd
时,您会得到一个非强类型的rdd[Row]
。如果您希望能够映射元素,则需要在行上进行模式匹配:
scala> val input = Seq(
| (0, "A", "B", "C", "D"),
| (1, "A", "B", "C", "D"),
| (0, "d", "a", "jkl", "d"),
| (0, "d", "g", "C", "D"),
| (1, "A", "d", "t", "k"),
| (1, "d", "c", "C", "D"),
| (1, "c", "B", "C", "D")
| )
input: Seq[(Int, String, String, String, String)] = List((0,A,B,C,D), (1,A,B,C,D), (0,d,a,jkl,d), (0,d,g,C,D), (1,A,d,t,k), (1,d,c,C,D), (1,c,B,C,D))
scala> val inputDf = input.toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
inputDf: org.apache.spark.sql.DataFrame = [TARGET: int, col1: string ... 3 more fields]
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rowRDD = inputDf.rdd
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at rdd at <console>:27
scala> val typedRDD = rowRDD.map{case Row(a: Int, b: String, c: String, d: String, e: String) => (a,b,c,d,e)}
typedRDD: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = MapPartitionsRDD[20] at map at <console>:29
scala> typedRDD.keyBy(_._1).groupByKey.foreach{println}
[Stage 7:> (0 + 0) / 4]
(0,CompactBuffer((A,B,C,D), (d,a,jkl,d), (d,g,C,D)))
(1,CompactBuffer((A,B,C,D), (A,d,t,k), (d,c,C,D), (c,B,C,D)))
当您在数据帧上调用.rdd
时,您会得到一个非强类型的rdd[Row]
。如果您希望能够映射元素,则需要在Row
上进行模式匹配:
scala> val input = Seq(
| (0, "A", "B", "C", "D"),
| (1, "A", "B", "C", "D"),
| (0, "d", "a", "jkl", "d"),
| (0, "d", "g", "C", "D"),
| (1, "A", "d", "t", "k"),
| (1, "d", "c", "C", "D"),
| (1, "c", "B", "C", "D")
| )
input: Seq[(Int, String, String, String, String)] = List((0,A,B,C,D), (1,A,B,C,D), (0,d,a,jkl,d), (0,d,g,C,D), (1,A,d,t,k), (1,d,c,C,D), (1,c,B,C,D))
scala> val inputDf = input.toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
inputDf: org.apache.spark.sql.DataFrame = [TARGET: int, col1: string ... 3 more fields]
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rowRDD = inputDf.rdd
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at rdd at <console>:27
scala> val typedRDD = rowRDD.map{case Row(a: Int, b: String, c: String, d: String, e: String) => (a,b,c,d,e)}
typedRDD: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = MapPartitionsRDD[20] at map at <console>:29
scala> typedRDD.keyBy(_._1).groupByKey.foreach{println}
[Stage 7:> (0 + 0) / 4]
(0,CompactBuffer((A,B,C,D), (d,a,jkl,d), (d,g,C,D)))
(1,CompactBuffer((A,B,C,D), (A,d,t,k), (d,c,C,D), (c,B,C,D)))
因为我想在ml.管道中使用它,并且输出步骤是数据帧,“模式丢失了”,例如,我需要使用模式匹配?这是正确的吗?但是有相当多的列,有没有办法在某种程度上“推断”它们(部分shcema?是的,DF=>RDD
转换根本没有使用模式(我不认为有什么好方法可以强制使用它)。但是,看看我的新的数据集示例:不需要使用中间数据帧,而且它看起来像数据集可以很好地推断类型(在Spark 2.0中,我认为使用DF可以做的任何事情都可以使用DS)@GeorgHeiler(不确定是否已通知您^^^^^^)谢谢。的确,您是对的。但是,ml管道中的spark transformer将仅输出数据帧;)即使使用数据集作为输入。因此,我认为模式将在后续转换步骤中丢失。我在这里发布了一个后续问题,也许您也有一个建议。因为我想在ml.管道中使用它,而输出步骤是数据帧“模式丢失”例如,我需要使用模式匹配?这是正确的吗?但是有相当多的列,是否有某种方法可以“推断”它们(部分shcema?是的,DF=>RDD
转换根本没有使用模式,很遗憾(我认为没有一种好的方法强制使用它)但是,看看我的新的数据集
示例:不需要使用中间数据帧
,它看起来像数据集
可以很好地推断类型(在Spark 2.0中,我想你可以用DF做的任何事情也可以用DS做)@GeorgHeiler(不确定你是否收到通知)谢谢。你确实是对的。但是,即使使用dataset作为输入,ml管道中的spark transformer也只能输出数据帧;),因此我认为模式将在后续的transformer步骤中丢失。我在这里发布了一个后续问题,也许你还有一个建议。