pyspark滚动窗口时间框架
我正在尝试实现一个30分钟的滚动窗口,该窗口按源ip分组。这个想法是为了得到每个源ip的平均值。我不确定这样做是否正确。我遇到的问题是ip 192.168.1.3,它的平均时间似乎超过了30分钟,因为包25是几天后的pyspark滚动窗口时间框架,pyspark,Pyspark,我正在尝试实现一个30分钟的滚动窗口,该窗口按源ip分组。这个想法是为了得到每个源ip的平均值。我不确定这样做是否正确。我遇到的问题是ip 192.168.1.3,它的平均时间似乎超过了30分钟,因为包25是几天后的 df = sqlContext.createDataFrame([('192.168.1.1', 17, "2017-03-10T15:27:18+00:00"), ('192.168.1.2', 1, "
df = sqlContext.createDataFrame([('192.168.1.1', 17, "2017-03-10T15:27:18+00:00"),
('192.168.1.2', 1, "2017-03-15T12:27:18+00:00"),
('192.168.1.2', 2, "2017-03-15T12:28:18+00:00"),
('192.168.1.2', 3, "2017-03-15T12:29:18+00:00"),
('192.168.1.3', 4, "2017-03-15T12:28:18+00:00"),
('192.168.1.3', 5, "2017-03-15T12:29:18+00:00"),
('192.168.1.3', 25, "2017-03-18T11:27:18+00:00")],
["source_ip","packets", "timestampGMT"])
w = (Window()
.partitionBy("source_ip")
.orderBy(F.col("timestampGMT").cast('long'))
.rangeBetween(-1800, 0))
df = df.withColumn('rolling_average', F.avg("packets").over(w))
df.show(100,False)
这就是我得到的结果。我希望前两个项目的成绩为4.5分,第三个项目的成绩为25分
+-----------+-------+-------------------------+------------------+
|source_ip |packets|timestampGMT |rolling_average |
+-----------+-------+-------------------------+------------------+
|192.168.1.3|4 |2017-03-15T12:28:18+00:00|11.333333333333334|
|192.168.1.3|5 |2017-03-15T12:29:18+00:00|11.333333333333334|
|192.168.1.3|25 |2017-03-18T11:27:18+00:00|11.333333333333334|
|192.168.1.2|1 |2017-03-15T12:27:18+00:00|2.0 |
|192.168.1.2|2 |2017-03-15T12:28:18+00:00|2.0 |
|192.168.1.2|3 |2017-03-15T12:29:18+00:00|2.0 |
|192.168.1.1|17 |2017-03-10T15:27:18+00:00|17.0 |
+-----------+-------+-------------------------+------------------+
首先将字符串更改为时间戳,然后按其排序
import pyspark.sql.functions as F
from pyspark.sql import Window
w = (Window()
.partitionBy("source_ip")
.orderBy(F.col("timestamp"))
.rangeBetween(-1800, 0))
df = df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
.withColumn('rolling_average', F.avg("packets").over(w))
df.printSchema()
df.show(100,False)
root
|-- source_ip: string (nullable = true)
|-- packets: long (nullable = true)
|-- timestampGMT: string (nullable = true)
|-- timestamp: long (nullable = true)
|-- rolling_average: double (nullable = true)
+-----------+-------+-------------------------+----------+---------------+
|source_ip |packets|timestampGMT |timestamp |rolling_average|
+-----------+-------+-------------------------+----------+---------------+
|192.168.1.2|1 |2017-03-15T12:27:18+00:00|1489580838|1.0 |
|192.168.1.2|2 |2017-03-15T12:28:18+00:00|1489580898|1.5 |
|192.168.1.2|3 |2017-03-15T12:29:18+00:00|1489580958|2.0 |
|192.168.1.1|17 |2017-03-10T15:27:18+00:00|1489159638|17.0 |
|192.168.1.3|4 |2017-03-15T12:28:18+00:00|1489580898|4.0 |
|192.168.1.3|5 |2017-03-15T12:29:18+00:00|1489580958|4.5 |
|192.168.1.3|25 |2017-03-18T11:27:18+00:00|1489836438|25.0 |
+-----------+-------+-------------------------+----------+---------------+