计算Pyspark数据帧中的运行总数,并在出现情况时中断循环
我有一个spark数据框,其中我需要根据当前行和前一行基于col_x值的金额总和来计算运行总数。当列中出现负数时,我应该打破以前记录的运行总计,并从当前行开始执行运行总计 样本数据集: 预期输出应如下所示:计算Pyspark数据帧中的运行总数,并在出现情况时中断循环,pyspark,Pyspark,我有一个spark数据框,其中我需要根据当前行和前一行基于col_x值的金额总和来计算运行总数。当列中出现负数时,我应该打破以前记录的运行总计,并从当前行开始执行运行总计 样本数据集: 预期输出应如下所示: 如何使用pyspark通过数据帧实现这一点?我希望在实际场景中,您将有一个时间戳列来对数据进行排序,我使用带有zipindex的行号对数据进行排序,作为这里的解释基础 from pyspark.sql.window import Window import pyspark.sql.fun
如何使用pyspark通过数据帧实现这一点?我希望在实际场景中,您将有一个时间戳列来对数据进行排序,我使用带有zipindex的行号对数据进行排序,作为这里的解释基础
from pyspark.sql.window import Window
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("ID1", -17.9),
("ID1", 21.9),
("ID1", 236.9),
("ID1", 4.99),
("ID1", 610.2),
("ID1", -35.8),
("ID1",21.9),
("ID1",17.9)
]
schema = StructType([
StructField('Col_x', StringType(),True), \
StructField('Col_y', FloatType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("valuewithorder")
w = Window.partitionBy('Col_x').orderBy('index')
w1 = Window.partitionBy('Col_x','group').orderBy('index')
df_final=spark.sql("select value.Col_x,round(value.Col_y,1) as Col_y, index from valuewithorder")
"""Group The data into different groups based on the negative value existance"""
df_final = df_final.withColumn("valueChange",(f.col('Col_y')<0).cast("int")) \
.fillna(0,subset=["valueChange"])\
.withColumn("indicator",(~((f.col("valueChange") == 0))).cast("int"))\
.withColumn("group",f.sum(f.col("indicator")).over(w.rangeBetween(Window.unboundedPreceding, 0)))
"""Cumlative sum with idfferent parititon of group and col_x"""
df_cum_sum = df_final.withColumn("Col_z", sum('Col_y').over(w1))
df_cum_sum.createOrReplaceTempView("FinalCumSum")
df_cum_sum = spark.sql("select Col_x , Col_y ,round(Col_z,1) as Col_z from FinalCumSum")
df_cum_sum.show()
另一种方式
创建索引
df = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
重新生成列
df = df.select('index', 'value.*')#.show()
创建以负值为边界的组
w=Window.partitionBy().orderBy('index').rowsBetween(-sys.maxsize,0)
df=df.withColumn('cat', f.min('Col_y').over(w))
组内总和
y=Window.partitionBy('cat').orderBy(f.asc('index')).rowsBetween(Window.unboundedPreceding,0)
df.withColumn('cumsum', f.round(f.sum('Col_y').over(y),2)).sort('index').drop('cat','index').show()
结果
+-----+-------------------+------+
|Col_x| Col_y|cumsum|
+-----+-------------------+------+
| ID1|-17.899999618530273| -17.9|
| ID1| 21.899999618530273| 4.0|
| ID1| 236.89999389648438| 240.9|
| ID1| 4.989999771118164|245.89|
| ID1| 610.2000122070312|856.09|
| ID1| -35.79999923706055| -35.8|
| ID1| 21.899999618530273| -13.9|
| ID1| 17.899999618530273| 4.0|
+-----+-------------------+------+
+-----+-------------------+------+
|Col_x| Col_y|cumsum|
+-----+-------------------+------+
| ID1|-17.899999618530273| -17.9|
| ID1| 21.899999618530273| 4.0|
| ID1| 236.89999389648438| 240.9|
| ID1| 4.989999771118164|245.89|
| ID1| 610.2000122070312|856.09|
| ID1| -35.79999923706055| -35.8|
| ID1| 21.899999618530273| -13.9|
| ID1| 17.899999618530273| 4.0|
+-----+-------------------+------+