Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
scala程序搜索最新的值_Scala_Apache Spark_Bigdata - Fatal编程技术网

scala程序搜索最新的值

scala程序搜索最新的值,scala,apache-spark,bigdata,Scala,Apache Spark,Bigdata,我想根据下面的配置单元sql创建df: WITH FILTERED_table1 AS (select * , row_number() over (partition by key_timestamp order by datime DESC) rn FROM table1) scala function: import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ import

我想根据下面的配置单元sql创建df:

WITH FILTERED_table1 AS (select *
, row_number() over (partition by key_timestamp order by datime DESC) rn
FROM table1)

scala function:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val table1 = Window.partitionBy($"key_timestamp").orderBy($"datime".desc)
我查看了窗口函数,这就是我能想到的,我不知道如何在scala函数中编写它,因为我对scala非常陌生。如何从sql use scala函数返回df?
如有任何建议,将不胜感激

您的窗口规格是正确的。使用虚拟数据集,首先将原始配置单元表加载到数据帧中:

val df = spark.sql("""select * from table1""")

df.show
// +-------------+-------------------+
// |key_timestamp|             datime|
// +-------------+-------------------+
// |            1|2018-06-01 00:00:00|
// |            1|2018-07-01 00:00:00|
// |            2|2018-05-01 00:00:00|
// |            2|2018-07-01 00:00:00|
// |            2|2018-06-01 00:00:00|
// +-------------+-------------------+
要将窗口规范上的窗口函数行_编号应用于数据帧,请使用withColumn生成新列以捕获函数结果:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val window = Window.partitionBy($"key_timestamp").orderBy($"datime".desc)

val resultDF = df.withColumn("rn", row_number.over(window))

resultDF.show
// +-------------+-------------------+---+
// |key_timestamp|             datime| rn|
// +-------------+-------------------+---+
// |            1|2018-07-01 00:00:00|  1|
// |            1|2018-06-01 00:00:00|  2|
// |            2|2018-07-01 00:00:00|  1|
// |            2|2018-06-01 00:00:00|  2|
// |            2|2018-05-01 00:00:00|  3|
// +-------------+-------------------+---+
要进行验证,请针对表1运行SQL,您应该会得到相同的结果:

spark.sql("""
    select *, row_number() over
      (partition by key_timestamp order by datime desc) rn
    from table1
  """).show