Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark 按键groupByKey或aggregateByKey分区后保持顺序_Apache Spark_Rdd - Fatal编程技术网

Apache spark 按键groupByKey或aggregateByKey分区后保持顺序

Apache spark 按键groupByKey或aggregateByKey分区后保持顺序,apache-spark,rdd,Apache Spark,Rdd,我有这样的数据 Machine , date , hours 123,2014-06-15,15.4 123,2014-06-16,20.3 123,2014-06-18,11.4 131,2014-06-15,12.2 131,2014-06-16,11.5 131,2014-06-17,18.2 131,2014-06-18,19.2 134,2014-06-15,11.1 134,2014-06-16,16.2 我想按键机器进行分区,并按1默认值0查找滞后小时数 Machine

我有这样的数据

Machine , date , hours 
123,2014-06-15,15.4 
123,2014-06-16,20.3
123,2014-06-18,11.4 
131,2014-06-15,12.2 
131,2014-06-16,11.5
131,2014-06-17,18.2 
131,2014-06-18,19.2
134,2014-06-15,11.1
134,2014-06-16,16.2
我想按键
机器
进行分区,并按1默认值0查找滞后小时数

Machine , date , hours lag
123,2014-06-15,15.4,0
123,2014-06-16,20.3,15.4
123,2014-06-18,11.4,20.3
131,2014-06-15,12.2,0
131,2014-06-16,11.5,12.2
131,2014-06-17,18.2,11.5
131,2014-06-18,19.2,18.2
134,2014-06-15,11.1,0
134,2014-06-16,16.2,11.1

我使用的是
PairedRDD
groupByKey
方法,但它没有按预期的顺序生成。

因为这里确实没有给定的顺序。除了一些例外,如果您使用的任何转换需要洗牌,RDD应该被认为是无序的

如果需要特定顺序,则必须手动对数据进行排序:

case class Record(machine: Long, date: java.sql.Date, hours: Double)
case class RecordWithLag(
    machine: Long, date: java.sql.Date, hours: Double, lag: Double
)

def getLag(xs: Seq[Record]): Seq[RecordWithLag] = ???

val rdd = sc.parallelize(List(
    Record(123, java.sql.Date.valueOf("2014-06-15"), 15.4), 
    Record(123, java.sql.Date.valueOf("2014-06-16"), 20.3),
    Record(123, java.sql.Date.valueOf("2014-06-18"), 11.4), 
    Record(131, java.sql.Date.valueOf("2014-06-15"), 12.2), 
    Record(131, java.sql.Date.valueOf("2014-06-16"), 11.5),
    Record(131, java.sql.Date.valueOf("2014-06-17"), 18.2), 
    Record(131, java.sql.Date.valueOf("2014-06-18"), 19.2),
    Record(134, java.sql.Date.valueOf("2014-06-15"), 11.1),
    Record(134, java.sql.Date.valueOf("2014-06-16"), 16.2)
))

rdd
  .groupBy(_.machine)
  .mapValues(_.toSeq.sortWith((x, y) => x.date.compareTo(y.date) < 0))
  .mapValues(getLag)

val df = sqlContext.createDataFrame(rdd)
df.registerTempTable("df")
sqlContext.sql(
  """"SELECT *, lag(hours, 1, 0) OVER (
        PARTITION BY machine ORDER BY date
      ) lag FROM df"""
)

+-------+----------+-----+----+
|machine|      date|hours| lag|
+-------+----------+-----+----+
|    123|2014-06-15| 15.4| 0.0|
|    123|2014-06-16| 20.3|15.4|
|    123|2014-06-18| 11.4|20.3|
|    131|2014-06-15| 12.2| 0.0|
|    131|2014-06-16| 11.5|12.2|
|    131|2014-06-17| 18.2|11.5|
|    131|2014-06-18| 19.2|18.2|
|    134|2014-06-15| 11.1| 0.0|
|    134|2014-06-16| 16.2|11.1|
+-------+----------+-----+----+
df.select(
  $"*",
  lag($"hours", 1, 0).over(
      Window.partitionBy($"machine").orderBy($"date")
  ).alias("lag")
)