Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
dataframe spark scala为每个组取最大值(最小值)_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

dataframe spark scala为每个组取最大值(最小值)

dataframe spark scala为每个组取最大值(最小值),scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个来自处理部件的数据帧,看起来像: +---------+------+-----------+ |Time |group |value | +---------+------+-----------+ | 28371| 94| 906| | 28372| 94| 864| | 28373| 94| 682| | 28374| 94| 574| | 2838

我有一个来自处理部件的数据帧,看起来像:

   +---------+------+-----------+
|Time     |group |value      |
+---------+------+-----------+
|    28371|    94|        906|
|    28372|    94|        864|
|    28373|    94|        682|
|    28374|    94|        574|
|    28383|    95|        630|
|    28384|    95|        716|
|    28385|    95|        913|
我想取每个组的最大时间值-最小时间值,得到以下结果:

+------+-----------+
|group |  value    |
+------+-----------+
|    94|       -332|
|    95|        283|
提前谢谢你的帮助

df.groupBy("groupCol").agg(max("value")-min("value"))
根据OP编辑的问题,这里有一种在PySpark中执行此操作的方法。其思想是按每组时间的升序和降序计算行号,并使用这些值进行减法运算

from pyspark.sql import Window
from pyspark.sql import functions as func
w_asc = Window.partitionBy(df.groupCol).orderBy(df.time)
w_desc = Window.partitionBy(df.groupCol).orderBy(func.desc(df.time))
df = df.withColumn(func.row_number().over(w_asc).alias('rnum_asc')) \
       .withColumn(func.row_number().over(w_desc).alias('rnum_desc'))
df.groupBy(df.groupCol) \
  .agg((func.max(func.when(df.rnum_desc==1,df.value))-func.max(func.when(df.rnum_asc==1,df.value))).alias('diff')).show()
如果sparksql中有windowfunctionfirst_值,那么就更容易了。使用SQL解决此问题的一般方法是

select distinct groupCol,diff
from (
select t.*
      ,first_value(val) over(partition by groupCol order by time) - 
       first_value(val) over(partition by groupCol order by time desc) as diff
from tbl t
) t 

回答问题后再改变是不公平的。无论如何,我提供了一个答案。你是对的,对此我很抱歉。我非常感谢你,这是非常清楚的,我知道如何进行,谢谢你