Apache spark 带地板和天花板的运行和/累计和
我是spark的新手,我正在尝试计算一个窗口运行总和,它的下限是0,上限是8 下面给出了一个玩具示例(请注意,实际数据接近数百万行): 这将创建表Apache spark 带地板和天花板的运行和/累计和,apache-spark,pyspark,pyspark-sql,pyspark-dataframes,Apache Spark,Pyspark,Pyspark Sql,Pyspark Dataframes,我是spark的新手,我正在尝试计算一个窗口运行总和,它的下限是0,上限是8 下面给出了一个玩具示例(请注意,实际数据接近数百万行): 这将创建表 +----+---+-------+ |aIds|day|eCounts| +----+---+-------+ | 1| 1| -3| | 1| 2| 3| | 1| 3| -6| | 1| 4| 3| | 2| 1| 3| | 2| 2| 6| | 2
+----+---+-------+
|aIds|day|eCounts|
+----+---+-------+
| 1| 1| -3|
| 1| 2| 3|
| 1| 3| -6|
| 1| 4| 3|
| 2| 1| 3|
| 2| 2| 6|
| 2| 3| -3|
| 2| 4| -6|
| 3| 1| 3|
| 3| 2| 3|
| 3| 3| 3|
| 3| 4| -3|
+----+---+-------+
下面是执行运行求和的结果示例,以及预期输出runSumCap
+----+---+-------+------+---------+
|aIds|day|eCounts|runSum|runSumCap|
+----+---+-------+------+---------+
| 1| 1| -3| -3| 0| <-- reset to 0
| 1| 2| 3| 0| 3|
| 1| 3| -6| -6| 0| <-- reset to 0
| 1| 4| 3| -3| 3|
| 2| 1| 3| 3| 3|
| 2| 2| 6| 9| 8| <-- reset to 8
| 2| 3| -3| 6| 5|
| 2| 4| -6| 0| 0| <-- reset to 0
| 3| 1| 3| 3| 3|
| 3| 2| 3| 6| 6|
| 3| 3| 3| 9| 8| <-- reset to 8
| 3| 4| -3| 6| 5|
+----+---+-------+------+---------+
为了达到预期效果,我已尝试查看@pandas\u udf以修改总和:
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def runSumCap(counts):
#counts columns is passed as a pandas series
floor = 0
cap = 8
runSum = 0
runSumList = []
for count in counts.tolist():
runSum = runSum + count
if(runSum > cap):
runSum = 8
elif(runSum < floor ):
runSum = 0
runSumList += [runSum]
return pd.Series(runSumList)
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('runSum',runSumCap(sdf['counts']).over(partition))
不幸的是,类型为
GROUPED\u AGG
的pandas\u udf
窗口函数不适用于有界窗口函数(.rowsbeween(window.unboundedreceiding,window.currentRow)
)。它当前仅适用于无界窗口,即.rowsBetween(Window.unbounddpreceiding,Window.unboundedFollowing)
。此外,输入为熊猫系列,但输出应为所提供类型的常量。因此,您将无法使用它实现部分聚合
相反,您可以使用GROUPED\u MAP
pandas\u udf
,它与df.groupBy().apply()一起工作。
下面是一些代码:
@pandas_udf('ids integer, day integer, counts integer, runSum integer', PandasUDFType.GROUPED_MAP)
def runSumCap(pdf):
def _apply_on_series(counts):
floor = 0
cap = 8
runSum = 0
runSumList = []
for count in counts.tolist():
runSum = runSum + count
if(runSum > cap):
runSum = 8
elif(runSum < floor ):
runSum = 0
runSumList += [runSum]
return pd.Series(runSumList)
pdf.sort_values(by=['day'], inplace=True)
pdf['runSum'] = _apply_on_series(pdf['counts'])
return pdf
sdf1 = sdf.groupBy('ids').apply(runSumCap)
@pandas\u udf('ids integer,day integer,counts integer,runSum integer',PandasUDFType.GROUPED\u MAP)
def runSumCap(pdf):
def_应用于_系列(计数):
地板=0
上限=8
运行和=0
runSumList=[]
对于counts.tolist()中的count:
运行和=运行和+计数
如果(运行总和>上限):
runSum=8
elif(运行金额<楼层):
运行和=0
runSumList+=[runSum]
返回pd.系列(runSumList)
pdf.sort_值(by=['day'],inplace=True)
pdf['runSum']=\u应用于\u系列(pdf['counts'])
返回pdf
sdf1=sdf.groupBy('id').apply(runSumCap)
我找到了一种方法,首先在每一行中创建一个数组(使用collect\u list作为窗口函数),其中包含用于对该点进行汇总的值。
然后,我定义了一个udf(不能用pandas_udf实现这一点),这就成功了。
以下是完整的可复制示例:
import pyspark.sql.functions as F
from pyspark.sql import Window
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import numpy as np
def accumalate(iterable):
total = 0
ceil = 8
floor = 0
for element in iterable:
total = total + element
if (total > ceil):
total = ceil
elif (total < floor):
total = floor
return total
pdf = pd.DataFrame({'aIds': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'day': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'eCounts': [-3, 3, -6, 3, 3, 6, -3, -6, 3, 3, 3, -3]})
sdf = spark.createDataFrame(pdf)
sdf = sdf.orderBy(sdf.aIds,sdf.day)
runSumCap = F.udf(accumalate,LongType())
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('splitWindow',F.collect_list(sdf.eCounts).over(partition))
sdf2 = sdf1.withColumn('runSumCap',runSumCap(sdf1.splitWindow))
sdf2.orderBy('aIds','day').show()
您有允许您订购数据帧的列吗?默认情况下,它们是无序的(.Yes,我可以在aIds和day上排序。检查更新的问题!此解决方案返回正确的序列,但问题是它没有排序。即id 1 day将为0,它的实际值将随机映射到day 4。如果需要,您也可以按id
排序。您可以只执行sdf1。排序('id',day'))
最后。我已经尝试过了。它仍然没有将结果映射到行,即使您可以看到运行总和的顺序是正确的。我实际上做了一个小的编辑。我忘记了在熊猫df中执行就地
排序。这意味着总和是在天
和pdf.排序值的任意顺序上完成的(按=['day'])
只是什么都没做。我也这么做,但没用。无论如何,我会奖励你的赏金:)
+----+---+-------+------+
|aIds|day|eCounts|runSum|
+----+---+-------+------+
| 1| 1| -3| 0|
| 1| 2| 3| 0|
| 1| 3| -6| 3|
| 1| 4| 3| 3|
| 2| 1| 3| 3|
| 2| 2| 6| 8|
| 2| 3| -3| 0|
| 2| 4| -6| 5|
| 3| 1| 3| 6|
| 3| 2| 3| 3|
| 3| 3| 3| 8|
| 3| 4| -3| 5|
+----+---+-------+------+
@pandas_udf('ids integer, day integer, counts integer, runSum integer', PandasUDFType.GROUPED_MAP)
def runSumCap(pdf):
def _apply_on_series(counts):
floor = 0
cap = 8
runSum = 0
runSumList = []
for count in counts.tolist():
runSum = runSum + count
if(runSum > cap):
runSum = 8
elif(runSum < floor ):
runSum = 0
runSumList += [runSum]
return pd.Series(runSumList)
pdf.sort_values(by=['day'], inplace=True)
pdf['runSum'] = _apply_on_series(pdf['counts'])
return pdf
sdf1 = sdf.groupBy('ids').apply(runSumCap)
import pyspark.sql.functions as F
from pyspark.sql import Window
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import numpy as np
def accumalate(iterable):
total = 0
ceil = 8
floor = 0
for element in iterable:
total = total + element
if (total > ceil):
total = ceil
elif (total < floor):
total = floor
return total
pdf = pd.DataFrame({'aIds': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'day': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'eCounts': [-3, 3, -6, 3, 3, 6, -3, -6, 3, 3, 3, -3]})
sdf = spark.createDataFrame(pdf)
sdf = sdf.orderBy(sdf.aIds,sdf.day)
runSumCap = F.udf(accumalate,LongType())
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('splitWindow',F.collect_list(sdf.eCounts).over(partition))
sdf2 = sdf1.withColumn('runSumCap',runSumCap(sdf1.splitWindow))
sdf2.orderBy('aIds','day').show()
+----+---+-------+--------------+---------+
|aIds|day|eCounts| splitWindow|runSumCap|
+----+---+-------+--------------+---------+
| 1| 1| -3| [-3]| 0|
| 1| 2| 3| [-3, 3]| 3|
| 1| 3| -6| [-3, 3, -6]| 0|
| 1| 4| 3|[-3, 3, -6, 3]| 3|
| 2| 1| 3| [3]| 3|
| 2| 2| 6| [3, 6]| 8|
| 2| 3| -3| [3, 6, -3]| 5|
| 2| 4| -6|[3, 6, -3, -6]| 0|
| 3| 1| 3| [3]| 3|
| 3| 2| 3| [3, 3]| 6|
| 3| 3| 3| [3, 3, 3]| 8|
| 3| 4| -3| [3, 3, 3, -3]| 5|
+----+---+-------+--------------+---------+