Scala 比较dataframe不同行中的值,并使用满足条件的行创建新dataframe
我需要在数据帧的不同行上应用一些逻辑,并创建一个新的数据帧,其中的行只满足逻辑 输入数据帧如下所示Scala 比较dataframe不同行中的值,并使用满足条件的行创建新dataframe,scala,dataframe,apache-spark,apache-spark-sql,apache-spark-dataset,Scala,Dataframe,Apache Spark,Apache Spark Sql,Apache Spark Dataset,我需要在数据帧的不同行上应用一些逻辑,并创建一个新的数据帧,其中的行只满足逻辑 输入数据帧如下所示 +------------+-------------+-----+-----+-----+-----+ | NUM_ID | E |SG1_V|SG2_V|SG3_V|SG4_V| +------------+-------------+-----+-----+-----+-----+ |XXXXX01 |1570167499000| | |
+------------+-------------+-----+-----+-----+-----+
| NUM_ID | E |SG1_V|SG2_V|SG3_V|SG4_V|
+------------+-------------+-----+-----+-----+-----+
|XXXXX01 |1570167499000| | | 89.0| |
|XXXXX01 |1570167502000| |88.0 | | |
|XXXXX01 |1570167503000| |99.0 | | |
|XXXXX01 |1570179810000|81.0 |81.0 |81.0 |81.0 |
|XXXXX01 |1570179811000|92.0 | |95.0 | |
|XXXXX01 |1570179833000| | |88.0 | |
|XXXXX02 |1570179840000| |81.0 | |81.0 |
|XXXXX02 |1570179841000|81.0 | |81.0 |81.0 |
|XXXXX02 |1570179841000| | | | |
|XXXXX02 |1570179842000|81.0 | | | |
|XXXXX02 |1570179843000|87.0 |98.0 |97.0 |88.0 |
|XXXXX02 |1570179849000| | | | |
|XXXXX03 |1570179850000| | | | |
|XXXXX03 |1570179852000|88.0 | | | |
|XXXXX03 |1570179857000| | | |88.0 |
|XXXXX03 |1570179858000| | | |88.0 |
我必须检查每个SG_V列的值,以确保NUM_ID的每个SG_V之间的差值大于10。一行中单个SG_V或多个SG_V列的差值10将被视为一行
一旦您了解了预期的输出,就会清楚了。
预期产出如下
+------------+-------------+------------+-----+------------+-----+------------+-----+------------+-----+
| NUM_ID | E |PREVIOUS_SG1|SG1_V|PREVIOUS_SG2|SG2_V|PREVIOUS_SG3|SG3_V|PREVIOUS_SG4|SG4_V|
+------------+-------------+------------+-----+------------+-----+------------+-----+------------+-----+
|XXXXX01 |1570167503000| | | 88.0 |99.0 | | | | |
|XXXXX01 |1570179811000|81.0 |92.0 | | |81.0 |95.0 | | |
|XXXXX02 |1570179843000| | |81.0 |98.0 |81.0 |97.0 | | |
提前谢谢!感谢任何潜在客户。可能是这样的: 我计算了差异,然后检查它是否大于10,放入布尔数组,最后使用数组_contains检查是否包含假值
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(10, 21, 32, 43),
(10, 20, 30, 40),
(1, 2, 3, 4),
(1, 100, 200, 300)
).toDF().withColumn("id",monotonically_increasing_id())
df.show()
val cols = df.columns.dropRight(1)
var pairs: Array[(String, String)] = new Array[(String, String)](cols.length - 1)
for (i <- 0 to cols.length - 2) {
pairs(i) = (cols.apply(i), cols.apply(i + 1))
}
println("pairs:")
pairs.foreach(print(_))
val calcDiff = array_contains(
array(
pairs.map(s=>(df(s._2)-df(s._1))>10):_*
), false
)
df.filter(calcDiff).show()
+---+---+---+---+---+
| _1| _2| _3| _4| id|
+---+---+---+---+---+
| 10| 21| 32| 43| 0|
| 10| 20| 30| 40| 1|
| 1| 2| 3| 4| 2|
| 1|100|200|300| 3|
+---+---+---+---+---+
pairs:
(_1,_2)(_2,_3)(_3,_4)
+---+---+---+---+---+
| _1| _2| _3| _4| id|
+---+---+---+---+---+
| 10| 21| 32| 43| 0|
| 1|100|200|300| 3|
+---+---+---+---+---+