Apache spark 带地板和天花板的运行和/累计和_Apache Spark_Pyspark_Pyspark Sql_Pyspark Dataframes

Apache spark 带地板和天花板的运行和/累计和

apache-spark pyspark

Apache spark 带地板和天花板的运行和/累计和,apache-spark,pyspark,pyspark-sql,pyspark-dataframes,Apache Spark,Pyspark,Pyspark Sql,Pyspark Dataframes,我是spark的新手，我正在尝试计算一个窗口运行总和，它的下限是0，上限是8 下面给出了一个玩具示例（请注意，实际数据接近数百万行）：这将创建表 +----+---+-------+ |aIds|day|eCounts| +----+---+-------+ | 1| 1| -3| | 1| 2| 3| | 1| 3| -6| | 1| 4| 3| | 2| 1| 3| | 2| 2| 6| | 2

我是spark的新手，我正在尝试计算一个窗口运行总和，它的下限是0，上限是8

下面给出了一个玩具示例（请注意，实际数据接近数百万行）：

这将创建表

+----+---+-------+
|aIds|day|eCounts|
+----+---+-------+
|   1|  1|     -3|
|   1|  2|      3|
|   1|  3|     -6|
|   1|  4|      3|
|   2|  1|      3|
|   2|  2|      6|
|   2|  3|     -3|
|   2|  4|     -6|
|   3|  1|      3|
|   3|  2|      3|
|   3|  3|      3|
|   3|  4|     -3|
+----+---+-------+

下面是执行运行求和的结果示例，以及预期输出runSumCap

+----+---+-------+------+---------+
|aIds|day|eCounts|runSum|runSumCap|
+----+---+-------+------+---------+
|   1|  1|     -3|    -3|        0| <-- reset to 0
|   1|  2|      3|     0|        3|
|   1|  3|     -6|    -6|        0| <-- reset to 0
|   1|  4|      3|    -3|        3|
|   2|  1|      3|     3|        3|
|   2|  2|      6|     9|        8| <-- reset to 8
|   2|  3|     -3|     6|        5| 
|   2|  4|     -6|     0|        0| <-- reset to 0
|   3|  1|      3|     3|        3|
|   3|  2|      3|     6|        6|
|   3|  3|      3|     9|        8| <-- reset to 8
|   3|  4|     -3|     6|        5|
+----+---+-------+------+---------+

为了达到预期效果，我已尝试查看@pandas\u udf以修改总和：

@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def runSumCap(counts):
    #counts columns is passed as a pandas series
    floor = 0
    cap = 8
    runSum = 0
    runSumList = []
    for count in counts.tolist():
      runSum = runSum + count
      if(runSum > cap):
        runSum = 8
      elif(runSum < floor ):
        runSum = 0
      runSumList += [runSum]
    return pd.Series(runSumList)


partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('runSum',runSumCap(sdf['counts']).over(partition))

不幸的是，类型为

GROUPED\u AGG

的

pandas\u udf

窗口函数不适用于有界窗口函数（

.rowsbeween（window.unboundedreceiding，window.currentRow）

）。它当前仅适用于无界窗口，即

.rowsBetween（Window.unbounddpreceiding，Window.unboundedFollowing）

。此外，输入为熊猫系列，但输出应为所提供类型的常量。因此，您将无法使用它实现部分聚合

相反，您可以使用

GROUPED\u MAP

pandas\u udf

，它与

df.groupBy（）.apply（）一起工作。
下面是一些代码：
@pandas_udf('ids integer, day integer, counts integer, runSum integer', PandasUDFType.GROUPED_MAP)
def runSumCap(pdf):
    def _apply_on_series(counts):
        floor = 0
        cap = 8
        runSum = 0
        runSumList = []
        for count in counts.tolist():
            runSum = runSum + count
            if(runSum > cap):
                runSum = 8
            elif(runSum < floor ):
                runSum = 0
            runSumList += [runSum]
        return pd.Series(runSumList)
    pdf.sort_values(by=['day'], inplace=True)
    pdf['runSum'] = _apply_on_series(pdf['counts'])
    return pdf


sdf1 = sdf.groupBy('ids').apply(runSumCap)

@pandas\u udf（'ids integer，day integer，counts integer，runSum integer'，PandasUDFType.GROUPED\u MAP）
def runSumCap（pdf）：
def_应用于_系列（计数）：
地板=0
上限=8
运行和=0
runSumList=[]
对于counts.tolist（）中的count：
运行和=运行和+计数
如果（运行总和>上限）：
runSum=8
elif（运行金额<楼层）：
运行和=0
runSumList+=[runSum]
返回pd.系列（runSumList）
pdf.sort_值（by=['day']，inplace=True）
pdf['runSum']=\u应用于\u系列（pdf['counts']）
返回pdf
sdf1=sdf.groupBy（'id'）.apply（runSumCap）
我找到了一种方法，首先在每一行中创建一个数组（使用collect\u list作为窗口函数），其中包含用于对该点进行汇总的值。
然后，我定义了一个udf（不能用pandas_udf实现这一点），这就成功了。
以下是完整的可复制示例：
import pyspark.sql.functions as F
from pyspark.sql import Window
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import numpy as np

def accumalate(iterable):
    total = 0
    ceil = 8
    floor = 0
    for element in iterable:
        total = total + element
        if (total > ceil):
          total = ceil
        elif (total < floor):
          total = floor
    return total

pdf = pd.DataFrame({'aIds':    [1,  1,  1,  1, 2, 2,  2,  2, 3, 3, 3,  3],
                    'day':    [1,  2,  3,  4, 1, 2,  3,  4, 1, 2, 3,  4],
                    'eCounts': [-3, 3, -6,  3, 3, 6, -3, -6, 3, 3, 3, -3]})

sdf = spark.createDataFrame(pdf)
sdf = sdf.orderBy(sdf.aIds,sdf.day)

runSumCap = F.udf(accumalate,LongType())
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('splitWindow',F.collect_list(sdf.eCounts).over(partition))
sdf2 = sdf1.withColumn('runSumCap',runSumCap(sdf1.splitWindow))
sdf2.orderBy('aIds','day').show()

您有允许您订购数据帧的列吗？默认情况下，它们是无序的（.Yes，我可以在aIds和day上排序。检查更新的问题！此解决方案返回正确的序列，但问题是它没有排序。即id 1 day将为0，它的实际值将随机映射到day 4。如果需要，您也可以按id
排序。您可以只执行sdf1。排序（'id'，day'））
最后。我已经尝试过了。它仍然没有将结果映射到行，即使您可以看到运行总和的顺序是正确的。我实际上做了一个小的编辑。我忘记了在熊猫df中执行就地
排序。这意味着总和是在天
和pdf.排序值的任意顺序上完成的（按=['day']）只是什么都没做。我也这么做，但没用。无论如何，我会奖励你的赏金：）
+----+---+-------+------+
|aIds|day|eCounts|runSum|
+----+---+-------+------+
|   1|  1|     -3|     0|
|   1|  2|      3|     0|
|   1|  3|     -6|     3|
|   1|  4|      3|     3|
|   2|  1|      3|     3|
|   2|  2|      6|     8|
|   2|  3|     -3|     0|
|   2|  4|     -6|     5|
|   3|  1|      3|     6|
|   3|  2|      3|     3|
|   3|  3|      3|     8|
|   3|  4|     -3|     5|
+----+---+-------+------+

@pandas_udf('ids integer, day integer, counts integer, runSum integer', PandasUDFType.GROUPED_MAP)
def runSumCap(pdf):
    def _apply_on_series(counts):
        floor = 0
        cap = 8
        runSum = 0
        runSumList = []
        for count in counts.tolist():
            runSum = runSum + count
            if(runSum > cap):
                runSum = 8
            elif(runSum < floor ):
                runSum = 0
            runSumList += [runSum]
        return pd.Series(runSumList)
    pdf.sort_values(by=['day'], inplace=True)
    pdf['runSum'] = _apply_on_series(pdf['counts'])
    return pdf


sdf1 = sdf.groupBy('ids').apply(runSumCap)

import pyspark.sql.functions as F
from pyspark.sql import Window
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import numpy as np

def accumalate(iterable):
    total = 0
    ceil = 8
    floor = 0
    for element in iterable:
        total = total + element
        if (total > ceil):
          total = ceil
        elif (total < floor):
          total = floor
    return total

pdf = pd.DataFrame({'aIds':    [1,  1,  1,  1, 2, 2,  2,  2, 3, 3, 3,  3],
                    'day':    [1,  2,  3,  4, 1, 2,  3,  4, 1, 2, 3,  4],
                    'eCounts': [-3, 3, -6,  3, 3, 6, -3, -6, 3, 3, 3, -3]})

sdf = spark.createDataFrame(pdf)
sdf = sdf.orderBy(sdf.aIds,sdf.day)

runSumCap = F.udf(accumalate,LongType())
partition = Window.partitionBy('aIds').orderBy('aIds','day').rowsBetween(Window.unboundedPreceding, Window.currentRow)
sdf1 = sdf.withColumn('splitWindow',F.collect_list(sdf.eCounts).over(partition))
sdf2 = sdf1.withColumn('runSumCap',runSumCap(sdf1.splitWindow))
sdf2.orderBy('aIds','day').show()

+----+---+-------+--------------+---------+
|aIds|day|eCounts|   splitWindow|runSumCap|
+----+---+-------+--------------+---------+
|   1|  1|     -3|          [-3]|        0|
|   1|  2|      3|       [-3, 3]|        3|
|   1|  3|     -6|   [-3, 3, -6]|        0|
|   1|  4|      3|[-3, 3, -6, 3]|        3|
|   2|  1|      3|           [3]|        3|
|   2|  2|      6|        [3, 6]|        8|
|   2|  3|     -3|    [3, 6, -3]|        5|
|   2|  4|     -6|[3, 6, -3, -6]|        0|
|   3|  1|      3|           [3]|        3|
|   3|  2|      3|        [3, 3]|        6|
|   3|  3|      3|     [3, 3, 3]|        8|
|   3|  4|     -3| [3, 3, 3, -3]|        5|
+----+---+-------+--------------+---------+