Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/279.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Pyspark中时间戳上的滚动平均数和天数总和_Python_Pandas_Pyspark_Pyspark Sql_Pyspark Dataframes - Fatal编程技术网

Python Pyspark中时间戳上的滚动平均数和天数总和

Python Pyspark中时间戳上的滚动平均数和天数总和,python,pandas,pyspark,pyspark-sql,pyspark-dataframes,Python,Pandas,Pyspark,Pyspark Sql,Pyspark Dataframes,我有一个PySpark数据帧,其中时间戳以天为单位。下面是数据帧的示例(我们称之为df): 在这个数据框中,我想对所有的数据进行平均,并在三天的滚动时间窗内计算不同名字的分数总和。这意味着,对于数据框中的任何给定日期,查找name1当天、考虑日期前一天和考虑日期前一天的分数总和。在name1的所有日子里都做类似的事情。对所有类型的名称也做同样的练习,即名称2等我该怎么做? 我看了一下这篇文章,并尝试了以下方法 from pyspark.sql import SparkSession from p

我有一个PySpark数据帧,其中时间戳以天为单位。下面是数据帧的示例(我们称之为
df
):

在这个数据框中,我想对所有的数据进行平均,并在三天的滚动时间窗内计算不同名字的分数总和。这意味着,对于数据框中的任何给定日期,查找
name1
当天、考虑日期前一天和考虑日期前一天的分数总和。在
name1
的所有日子里都做类似的事情。对所有类型的
名称
也做同样的练习,即<代码>名称2等我该怎么做?

我看了一下这篇文章,并尝试了以下方法

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

days = lambda i: i*1

w_rolling = Window.orderBy(F.col("timestamp").cast("long")).rangeBetween(-days(3), 0)
df_agg = df.withColumn("rolling_average", F.avg("score").over(w_rolling)).withColumn(
    "rolling_sum", F.sum("score").over(w_rolling)
)
df_agg.show()

+-----+-----+----------+-----+------------------+-----------+
| name| type| timestamp|score|   rolling_average|rolling_sum|
+-----+-----+----------+-----+------------------+-----------+
|name1|type1|2012-01-10|   11|18.214285714285715|        255|
|name1|type1|2012-01-11|   14|18.214285714285715|        255|
|name1|type1|2012-01-12|    2|18.214285714285715|        255|
|name1|type3|2012-01-12|    3|18.214285714285715|        255|
|name1|type3|2012-01-11|   55|18.214285714285715|        255|
|name1|type1|2012-01-13|   10|18.214285714285715|        255|
|name1|type2|2012-01-14|   11|18.214285714285715|        255|
|name1|type2|2012-01-15|   14|18.214285714285715|        255|
|name2|type2|2012-01-10|    2|18.214285714285715|        255|
|name2|type2|2012-01-11|    3|18.214285714285715|        255|
|name2|type2|2012-01-12|   55|18.214285714285715|        255|
|name2|type1|2012-01-10|   10|18.214285714285715|        255|
|name2|type1|2012-01-13|   55|18.214285714285715|        255|
|name2|type1|2012-01-14|   10|18.214285714285715|        255|
+-----+-----+----------+-----+------------------+-----------+
如您所见,我总是得到相同的滚动平均值和滚动总和,这只是所有天
得分
列的平均值和总和。这不是我想要的

您可以使用以下代码段创建上述数据帧:


df_Stats = Row("name", "type", "timestamp", "score")

df_stat1 = df_Stats("name1", "type1", "2012-01-10", 11)
df_stat2 = df_Stats("name1", "type1", "2012-01-11", 14)
df_stat3 = df_Stats("name1", "type1", "2012-01-12", 2)
df_stat4 = df_Stats("name1", "type3", "2012-01-12", 3)
df_stat5 = df_Stats("name1", "type3", "2012-01-11", 55)
df_stat6 = df_Stats("name1", "type1", "2012-01-13", 10)
df_stat7 = df_Stats("name1", "type2", "2012-01-14", 11)
df_stat8 = df_Stats("name1", "type2", "2012-01-15", 14)
df_stat9 = df_Stats("name2", "type2", "2012-01-10", 2)
df_stat10 = df_Stats("name2", "type2", "2012-01-11", 3)
df_stat11 = df_Stats("name2", "type2", "2012-01-12", 55)
df_stat12 = df_Stats("name2", "type1", "2012-01-10", 10)
df_stat13 = df_Stats("name2", "type1", "2012-01-13", 55)
df_stat14 = df_Stats("name2", "type1", "2012-01-14", 10)

df_stat_lst = [
    df_stat1,
    df_stat2,
    df_stat3,
    df_stat4,
    df_stat5,
    df_stat6,
    df_stat7,
    df_stat8,
    df_stat9,
    df_stat10,
    df_stat11,
    df_stat12,
    df_stat13,
    df_stat14
]

df = spark.createDataFrame(df_stat_lst)

您可以使用下面的代码计算过去3天(包括当天)的总分和平均分

# Considering the dataframe already created using code provided in question
df = df.withColumn('unix_time', F.unix_timestamp('timestamp', 'yyyy-MM-dd'))

winSpec = Window.partitionBy('name').orderBy('unix_time').rangeBetween(-2*86400, 0)

df = df.withColumn('rolling_sum', F.sum('score').over(winSpec))
df = df.withColumn('rolling_avg', F.avg('score').over(winSpec))

df.orderBy('name', 'timestamp').show(20, False)

+-----+-----+----------+-----+----------+-----------+------------------+
|name |type |timestamp |score|unix_time |rolling_sum|rolling_avg       |
+-----+-----+----------+-----+----------+-----------+------------------+
|name1|type1|2012-01-10|11   |1326153600|11         |11.0              |
|name1|type3|2012-01-11|55   |1326240000|80         |26.666666666666668|
|name1|type1|2012-01-11|14   |1326240000|80         |26.666666666666668|
|name1|type1|2012-01-12|2    |1326326400|85         |17.0              |
|name1|type3|2012-01-12|3    |1326326400|85         |17.0              |
|name1|type1|2012-01-13|10   |1326412800|84         |16.8              |
|name1|type2|2012-01-14|11   |1326499200|26         |6.5               |
|name1|type2|2012-01-15|14   |1326585600|35         |11.666666666666666|
|name2|type1|2012-01-10|10   |1326153600|12         |6.0               |
|name2|type2|2012-01-10|2    |1326153600|12         |6.0               |
+-----+-----+----------+-----+----------+-----------+------------------+
# Considering the dataframe already created using code provided in question
df = df.withColumn('unix_time', F.unix_timestamp('timestamp', 'yyyy-MM-dd'))

winSpec = Window.partitionBy('name').orderBy('unix_time').rangeBetween(-2*86400, 0)

df = df.withColumn('rolling_sum', F.sum('score').over(winSpec))
df = df.withColumn('rolling_avg', F.avg('score').over(winSpec))

df.orderBy('name', 'timestamp').show(20, False)

+-----+-----+----------+-----+----------+-----------+------------------+
|name |type |timestamp |score|unix_time |rolling_sum|rolling_avg       |
+-----+-----+----------+-----+----------+-----------+------------------+
|name1|type1|2012-01-10|11   |1326153600|11         |11.0              |
|name1|type3|2012-01-11|55   |1326240000|80         |26.666666666666668|
|name1|type1|2012-01-11|14   |1326240000|80         |26.666666666666668|
|name1|type1|2012-01-12|2    |1326326400|85         |17.0              |
|name1|type3|2012-01-12|3    |1326326400|85         |17.0              |
|name1|type1|2012-01-13|10   |1326412800|84         |16.8              |
|name1|type2|2012-01-14|11   |1326499200|26         |6.5               |
|name1|type2|2012-01-15|14   |1326585600|35         |11.666666666666666|
|name2|type1|2012-01-10|10   |1326153600|12         |6.0               |
|name2|type2|2012-01-10|2    |1326153600|12         |6.0               |
+-----+-----+----------+-----+----------+-----------+------------------+