Scala 涉及分区和延迟的Spark SQL数据帧转换
我想像这样转换Spark SQL数据帧:Scala 涉及分区和延迟的Spark SQL数据帧转换,scala,apache-spark-sql,spark-dataframe,Scala,Apache Spark Sql,Spark Dataframe,我想像这样转换Spark SQL数据帧: animal value ------------ cat 8 cat 5 cat 6 dog 2 dog 4 dog 3 rat 7 rat 4 rat 9 animal value previous-value ----------------------------- cat 8
animal value
------------
cat 8
cat 5
cat 6
dog 2
dog 4
dog 3
rat 7
rat 4
rat 9
animal value previous-value
-----------------------------
cat 8 0
cat 5 8
cat 6 5
dog 2 0
dog 4 2
dog 3 4
rat 7 0
rat 4 7
rat 9 4
进入如下数据帧:
animal value
------------
cat 8
cat 5
cat 6
dog 2
dog 4
dog 3
rat 7
rat 4
rat 9
animal value previous-value
-----------------------------
cat 8 0
cat 5 8
cat 6 5
dog 2 0
dog 4 2
dog 3 4
rat 7 0
rat 4 7
rat 9 4
我有点想按
动物
进行分区,然后,对于每个动物
,上一个值
比值
落后一行(默认值为0
),然后再将分区放回一起。这段代码可以:
val df = spark.read.format("CSV").option("header","true").load("/home/shivansh/Desktop/foo.csv")
val df2 = df.groupBy("animal").agg(collect_list("value") as "listValue")
val desiredDF = df2.rdd.flatMap{row=>
val animal=row.getAs[String]("animal")
val valueList=row.getAs[Seq[String]]("listValue").toList
val newlist=valueList zip "0"::valueList
newlist.map(a=>(animal,a._1,a._2))
}.toDF("animal","value","previousValue")
在火花壳上:
scala> val df=spark.read.format("CSV").option("header","true").load("/home/shivansh/Desktop/foo.csv")
df: org.apache.spark.sql.DataFrame = [animal: string, value: string]
scala> df.show()
+------+-----+
|animal|value|
+------+-----+
| cat| 8|
| cat| 5|
| cat| 6|
| dog| 2|
| dog| 4|
| dog| 3|
| rat| 7|
| rat| 4 |
| rat| 9|
+------+-----+
scala> val df2=df.groupBy("animal").agg(collect_list("value") as "listValue")
df2: org.apache.spark.sql.DataFrame = [animal: string, listValue: array<string>]
scala> df2.show()
+------+----------+
|animal| listValue|
+------+----------+
| rat|[7, 4 , 9]|
| dog| [2, 4, 3]|
| cat| [8, 5, 6]|
+------+----------+
scala> val desiredDF=df2.rdd.flatMap{row=>
| val animal=row.getAs[String]("animal")
| val valueList=row.getAs[Seq[String]]("listValue").toList
| val newlist=valueList zip "0"::valueList
| newlist.map(a=>(animal,a._1,a._2))
| }.toDF("animal","value","previousValue")
desiredDF: org.apache.spark.sql.DataFrame = [animal: string, value: string ... 1 more field]
scala> desiredDF.show()
+------+-----+-------------+
|animal|value|previousValue|
+------+-----+-------------+
| rat| 7| 0|
| rat| 4 | 7|
| rat| 9| 4 |
| dog| 2| 0|
| dog| 4| 2|
| dog| 3| 4|
| cat| 8| 0|
| cat| 5| 8|
| cat| 6| 5|
+------+-----+-------------+
scala>val df=spark.read.format(“CSV”)选项(“header”、“true”).load(“/home/shivansh/Desktop/foo.CSV”)
df:org.apache.spark.sql.DataFrame=[动物:字符串,值:字符串]
scala>df.show()
+------+-----+
|动物价值|
+------+-----+
|第8类|
|第5类|
|第6类|
|狗| 2|
|狗| 4|
|狗| 3|
|老鼠| 7|
|鼠| 4|
|老鼠| 9|
+------+-----+
scala>val df2=df.groupBy(“动物”).agg(收集列表(“值”)作为“列表值”)
df2:org.apache.spark.sql.DataFrame=[动物:字符串,列表值:数组]
scala>df2.show()
+------+----------+
|动物价值|
+------+----------+
|老鼠|[7,4,9]|
|狗|[2,4,3]|
|猫|[8,5,6]|
+------+----------+
scala>val desiredDF=df2.rdd.flatMap{row=>
|val animal=row.getAs[String](“动物”)
|val valueList=row.getAs[Seq[String]](“listValue”).toList
|val newlist=valueList zip“0”::valueList
|地图(a=>(动物,a.。_1,a.。_2))
|}.toDF(“动物”、“价值”、“以前的价值”)
desiredDF:org.apache.spark.sql.DataFrame=[动物:字符串,值:字符串…1个其他字段]
scala>desiredDF.show()
+------+-----+-------------+
|动物|价值|以前的价值|
+------+-----+-------------+
|老鼠| 7 | 0|
|老鼠| 4 | 7|
|老鼠| 9 | 4|
|狗| 2 | 0|
|狗| 4 | 2|
|狗| 3 | 4|
|第8类| 0类|
|第5类第8类|
|第6类第5类|
+------+-----+-------------+
这一系列代码将起作用:
val df = spark.read.format("CSV").option("header","true").load("/home/shivansh/Desktop/foo.csv")
val df2 = df.groupBy("animal").agg(collect_list("value") as "listValue")
val desiredDF = df2.rdd.flatMap{row=>
val animal=row.getAs[String]("animal")
val valueList=row.getAs[Seq[String]]("listValue").toList
val newlist=valueList zip "0"::valueList
newlist.map(a=>(animal,a._1,a._2))
}.toDF("animal","value","previousValue")
在火花壳上:
scala> val df=spark.read.format("CSV").option("header","true").load("/home/shivansh/Desktop/foo.csv")
df: org.apache.spark.sql.DataFrame = [animal: string, value: string]
scala> df.show()
+------+-----+
|animal|value|
+------+-----+
| cat| 8|
| cat| 5|
| cat| 6|
| dog| 2|
| dog| 4|
| dog| 3|
| rat| 7|
| rat| 4 |
| rat| 9|
+------+-----+
scala> val df2=df.groupBy("animal").agg(collect_list("value") as "listValue")
df2: org.apache.spark.sql.DataFrame = [animal: string, listValue: array<string>]
scala> df2.show()
+------+----------+
|animal| listValue|
+------+----------+
| rat|[7, 4 , 9]|
| dog| [2, 4, 3]|
| cat| [8, 5, 6]|
+------+----------+
scala> val desiredDF=df2.rdd.flatMap{row=>
| val animal=row.getAs[String]("animal")
| val valueList=row.getAs[Seq[String]]("listValue").toList
| val newlist=valueList zip "0"::valueList
| newlist.map(a=>(animal,a._1,a._2))
| }.toDF("animal","value","previousValue")
desiredDF: org.apache.spark.sql.DataFrame = [animal: string, value: string ... 1 more field]
scala> desiredDF.show()
+------+-----+-------------+
|animal|value|previousValue|
+------+-----+-------------+
| rat| 7| 0|
| rat| 4 | 7|
| rat| 9| 4 |
| dog| 2| 0|
| dog| 4| 2|
| dog| 3| 4|
| cat| 8| 0|
| cat| 5| 8|
| cat| 6| 5|
+------+-----+-------------+
scala>val df=spark.read.format(“CSV”)选项(“header”、“true”).load(“/home/shivansh/Desktop/foo.CSV”)
df:org.apache.spark.sql.DataFrame=[动物:字符串,值:字符串]
scala>df.show()
+------+-----+
|动物价值|
+------+-----+
|第8类|
|第5类|
|第6类|
|狗| 2|
|狗| 4|
|狗| 3|
|老鼠| 7|
|鼠| 4|
|老鼠| 9|
+------+-----+
scala>val df2=df.groupBy(“动物”).agg(收集列表(“值”)作为“列表值”)
df2:org.apache.spark.sql.DataFrame=[动物:字符串,列表值:数组]
scala>df2.show()
+------+----------+
|动物价值|
+------+----------+
|老鼠|[7,4,9]|
|狗|[2,4,3]|
|猫|[8,5,6]|
+------+----------+
scala>val desiredDF=df2.rdd.flatMap{row=>
|val animal=row.getAs[String](“动物”)
|val valueList=row.getAs[Seq[String]](“listValue”).toList
|val newlist=valueList zip“0”::valueList
|地图(a=>(动物,a.。_1,a.。_2))
|}.toDF(“动物”、“价值”、“以前的价值”)
desiredDF:org.apache.spark.sql.DataFrame=[动物:字符串,值:字符串…1个其他字段]
scala>desiredDF.show()
+------+-----+-------------+
|动物|价值|以前的价值|
+------+-----+-------------+
|老鼠| 7 | 0|
|老鼠| 4 | 7|
|老鼠| 9 | 4|
|狗| 2 | 0|
|狗| 4 | 2|
|狗| 3 | 4|
|第8类| 0类|
|第5类第8类|
|第6类第5类|
+------+-----+-------------+
这可以通过使用
我添加了一个“时间”字段来说明orderBy
val w1 = Window.partitionBy($"animal").orderBy($"time")
val previous_value = lag($"value", 1).over(w1)
val df1 = df.withColumn("previous", previous_value)
df1.show
+------+-----+-----+--------+
|animal|value| time|previous|
+------+-----+-----+--------+
| dog| 2|02:00| null|
| dog| 4|04:00| 2|
| dog| 3|06:00| 4|
| cat| 8|01:00| null|
| cat| 5|02:00| 8|
| cat| 6|03:00| 5|
| rat| 7|01:00| null|
| rat| 4|03:00| 7|
| rat| 9|05:00| 4|
+------+-----+-----+--------+
如果要将空值替换为0:
val df2 = df1.na.fill(0)
df2.show
+------+-----+-----+--------+
|animal|value| time|previous|
+------+-----+-----+--------+
| dog| 2|02:00| 0|
| dog| 4|04:00| 2|
| dog| 3|06:00| 4|
| cat| 8|01:00| 0|
| cat| 5|02:00| 8|
| cat| 6|03:00| 5|
| rat| 7|01:00| 0|
| rat| 4|03:00| 7|
| rat| 9|05:00| 4|
+------+-----+-----+--------+
这可以通过使用一个 我添加了一个“时间”字段来说明orderBy
val w1 = Window.partitionBy($"animal").orderBy($"time")
val previous_value = lag($"value", 1).over(w1)
val df1 = df.withColumn("previous", previous_value)
df1.show
+------+-----+-----+--------+
|animal|value| time|previous|
+------+-----+-----+--------+
| dog| 2|02:00| null|
| dog| 4|04:00| 2|
| dog| 3|06:00| 4|
| cat| 8|01:00| null|
| cat| 5|02:00| 8|
| cat| 6|03:00| 5|
| rat| 7|01:00| null|
| rat| 4|03:00| 7|
| rat| 9|05:00| 4|
+------+-----+-----+--------+
如果要将空值替换为0:
val df2 = df1.na.fill(0)
df2.show
+------+-----+-----+--------+
|animal|value| time|previous|
+------+-----+-----+--------+
| dog| 2|02:00| 0|
| dog| 4|04:00| 2|
| dog| 3|06:00| 4|
| cat| 8|01:00| 0|
| cat| 5|02:00| 8|
| cat| 6|03:00| 5|
| rat| 7|01:00| 0|
| rat| 4|03:00| 7|
| rat| 9|05:00| 4|
+------+-----+-----+--------+
我现在没有时间尝试,但是1)不要依赖数据帧排序,添加一个显式的
索引
列,2)尝试通过动物
重新分区然后使用映射分区
进行行偏移。它可能不会很漂亮。我现在没有时间尝试,但是1)不要依赖数据帧排序,添加一个显式的索引
列,2)尝试通过动物
重新分区然后使用映射分区
进行行偏移。它可能不会很漂亮。我会小心地按字符串排序,我很确定spark会按字典顺序对它们进行排序(因此“12”
将小于“2”
,这是不需要的)。scala>“12”2”res0:Boolean=true
很好,evan058。在现实生活中,我会使用时间戳。我会通过字符串
s仔细排序,我非常确定spark会按字典顺序对它们进行排序(因此“12”
将小于“2”
,这是不需要的)。scala>“12”2“res0:Boolean=true
很好,evan058。在现实生活中,我会使用时间戳。