pyspark滚动窗口时间框架

pyspark滚动窗口时间框架,pyspark,Pyspark,我正在尝试实现一个30分钟的滚动窗口,该窗口按源ip分组。这个想法是为了得到每个源ip的平均值。我不确定这样做是否正确。我遇到的问题是ip 192.168.1.3,它的平均时间似乎超过了30分钟,因为包25是几天后的 df = sqlContext.createDataFrame([('192.168.1.1', 17, "2017-03-10T15:27:18+00:00"), ('192.168.1.2', 1, "

我正在尝试实现一个30分钟的滚动窗口,该窗口按源ip分组。这个想法是为了得到每个源ip的平均值。我不确定这样做是否正确。我遇到的问题是ip 192.168.1.3,它的平均时间似乎超过了30分钟,因为包25是几天后的

df = sqlContext.createDataFrame([('192.168.1.1', 17, "2017-03-10T15:27:18+00:00"),
                        ('192.168.1.2', 1, "2017-03-15T12:27:18+00:00"),
                        ('192.168.1.2', 2, "2017-03-15T12:28:18+00:00"),
                        ('192.168.1.2', 3, "2017-03-15T12:29:18+00:00"),
                        ('192.168.1.3', 4, "2017-03-15T12:28:18+00:00"),
                        ('192.168.1.3', 5, "2017-03-15T12:29:18+00:00"),
                        ('192.168.1.3', 25, "2017-03-18T11:27:18+00:00")],
                        ["source_ip","packets", "timestampGMT"])

w = (Window()
     .partitionBy("source_ip")
     .orderBy(F.col("timestampGMT").cast('long'))
     .rangeBetween(-1800, 0))

df = df.withColumn('rolling_average', F.avg("packets").over(w))

df.show(100,False)
这就是我得到的结果。我希望前两个项目的成绩为4.5分,第三个项目的成绩为25分

+-----------+-------+-------------------------+------------------+
|source_ip  |packets|timestampGMT             |rolling_average   |
+-----------+-------+-------------------------+------------------+
|192.168.1.3|4      |2017-03-15T12:28:18+00:00|11.333333333333334|
|192.168.1.3|5      |2017-03-15T12:29:18+00:00|11.333333333333334|
|192.168.1.3|25     |2017-03-18T11:27:18+00:00|11.333333333333334|
|192.168.1.2|1      |2017-03-15T12:27:18+00:00|2.0               |
|192.168.1.2|2      |2017-03-15T12:28:18+00:00|2.0               |
|192.168.1.2|3      |2017-03-15T12:29:18+00:00|2.0               |
|192.168.1.1|17     |2017-03-10T15:27:18+00:00|17.0              |
+-----------+-------+-------------------------+------------------+

首先将字符串更改为时间戳,然后按其排序

import pyspark.sql.functions as F
from pyspark.sql import Window

w = (Window()
     .partitionBy("source_ip")
     .orderBy(F.col("timestamp"))
     .rangeBetween(-1800, 0))

df = df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
    .withColumn('rolling_average', F.avg("packets").over(w))

df.printSchema()
df.show(100,False)


root
 |-- source_ip: string (nullable = true)
 |-- packets: long (nullable = true)
 |-- timestampGMT: string (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- rolling_average: double (nullable = true)

+-----------+-------+-------------------------+----------+---------------+
|source_ip  |packets|timestampGMT             |timestamp |rolling_average|
+-----------+-------+-------------------------+----------+---------------+
|192.168.1.2|1      |2017-03-15T12:27:18+00:00|1489580838|1.0            |
|192.168.1.2|2      |2017-03-15T12:28:18+00:00|1489580898|1.5            |
|192.168.1.2|3      |2017-03-15T12:29:18+00:00|1489580958|2.0            |
|192.168.1.1|17     |2017-03-10T15:27:18+00:00|1489159638|17.0           |
|192.168.1.3|4      |2017-03-15T12:28:18+00:00|1489580898|4.0            |
|192.168.1.3|5      |2017-03-15T12:29:18+00:00|1489580958|4.5            |
|192.168.1.3|25     |2017-03-18T11:27:18+00:00|1489836438|25.0           |
+-----------+-------+-------------------------+----------+---------------+