Python Pypark 5分钟滑动窗口总和_Python_Csv_Apache Spark_Pyspark_Windowing

Python Pypark 5分钟滑动窗口总和

python csv apache-spark pyspark

Python Pypark 5分钟滑动窗口总和,python,csv,apache-spark,pyspark,windowing,Python,Csv,Apache Spark,Pyspark,Windowing,我有这样的数据： ('2017-02-03', '22:57:00') ('2017-02-03', '23:02:00') ('2017-02-04', '09:56:00') ('2017-02-04', '10:01:00') ('2017-02-04', '10:06:00') ('2017-02-04', '10:11:00') ('2017-02-04', '10:16:00') ('2017-02-04', '10:21:00') ('2017-02-04', '10:26:00'

我有这样的数据：

('2017-02-03', '22:57:00')
('2017-02-03', '23:02:00')
('2017-02-04', '09:56:00')
('2017-02-04', '10:01:00')
('2017-02-04', '10:06:00')
('2017-02-04', '10:11:00')
('2017-02-04', '10:16:00')
('2017-02-04', '10:21:00')
('2017-02-04', '10:26:00')
('2017-02-04', '10:31:00')
('2017-02-04', '10:36:00')
('2017-02-04', '16:57:00')
('2017-02-04', '17:12:00')

('2017-02-03', '22:57:00') <- 1
('2017-02-03', '23:02:00') <- 2

('2017-02-04', '09:56:00') <- 1
('2017-02-04', '10:01:00') <- 2
('2017-02-04', '10:06:00') <- 3
('2017-02-04', '10:11:00') <- 4
('2017-02-04', '10:16:00') <- 5
('2017-02-04', '10:21:00') <- 6
('2017-02-04', '10:26:00') <- 7
('2017-02-04', '10:31:00') <- 8
('2017-02-04', '10:36:00') <- 9

('2017-02-04', '16:57:00') <- 1
('2017-02-04', '17:12:00') <- 1

我想做的是比较每个日期的时间，看看是否有5分钟的差异。如果有五分钟的差异，我会数一数他们一行有多少人。这将产生如下结果：

('2017-02-03', '22:57:00')
('2017-02-03', '23:02:00')
('2017-02-04', '09:56:00')
('2017-02-04', '10:01:00')
('2017-02-04', '10:06:00')
('2017-02-04', '10:11:00')
('2017-02-04', '10:16:00')
('2017-02-04', '10:21:00')
('2017-02-04', '10:26:00')
('2017-02-04', '10:31:00')
('2017-02-04', '10:36:00')
('2017-02-04', '16:57:00')
('2017-02-04', '17:12:00')

('2017-02-03', '22:57:00') <- 1
('2017-02-03', '23:02:00') <- 2

('2017-02-04', '09:56:00') <- 1
('2017-02-04', '10:01:00') <- 2
('2017-02-04', '10:06:00') <- 3
('2017-02-04', '10:11:00') <- 4
('2017-02-04', '10:16:00') <- 5
('2017-02-04', '10:21:00') <- 6
('2017-02-04', '10:26:00') <- 7
('2017-02-04', '10:31:00') <- 8
('2017-02-04', '10:36:00') <- 9

('2017-02-04', '16:57:00') <- 1
('2017-02-04', '17:12:00') <- 1

这是到目前为止我的代码

def check_interval(values, measurement):
    start_date = ""
    start_time = ""
    counter = 1
    res = ""

    for index, val in enumerate(values):
        if index + 1 == len(values):
            break

        date1, time1 = get_date_time(val)
        date2, time2 = get_date_time(values[index + 1])

        start_date = date1

        if counter == 1:
            start_time = time1

        date_time1 = ' '.join(val)
        date_time2 = ' '.join(values[index + 1])

        time_diff = subtract_time(date_time1, date_time2)

        if time_diff > timedelta(minutes=measurement):
            res = start_date + "\t(" + start_time + ", " + str(counter) + ")\n"
            print(res)
            counter = 1
        else:
            counter += 1

        if date1 != date2:
            start_date = date2


# ------------------------------------------
# FUNCTION my_main
# ------------------------------------------
def my_main(sc, my_dataset_dir, station_name, measurement_time):
   inputRDD = sc.textFile(my_dataset_dir)

   stationRDD = inputRDD \
        .map(process_line) \
        .filter(lambda line: (line[0] == '0' and line[1] == station_name and line[5] == '0')) \
        .map(lambda date_time: date_time[4]) \
        .map(split_date_time) \
        .sortByKey() \
        .collect()

    check_interval(stationRDD, measurement_time)

我有我想要的结果，但我想知道是否可以使用pyspark函数来实现这一点？并产生输出：

('2017-02-03', ('22:57:00', 2))
('2017-02-04', ('09:56:00', 9))
('2017-02-04', ('16:57:00', 1))
('2017-02-04', ('17:12:00', 1))

您可以将dataframe API与

窗口

函数一起使用：

import pyspark.sql.函数作为psf
从pyspark.sql导入窗口
w=Window.orderBy（'datetime'））
df\
.withColumn（'datetime'，psf.unix_时间戳（psf.concat（'date'，psf.lit（''，'time'））.cast（'timestamp'））\
.withColumn（'5min_delta'，（psf.col（'datetime'）-psf.lag（'datetime'）。超过（w））/60>5）\
.fillna（对）\
.withColumn（'group_id'，psf.sum（psf.col（'5min_delta'））.cast（'int'））.over（w））.show（）
+----------+--------+----------+----------+--------+
|日期|时间|日期时间| 5分钟|增量|组id|
+----------+--------+----------+----------+--------+
|2017-02-03 | 22:57:00 | 1486159020 |正确| 1|
|2017-02-03 | 23:02:00 | 1486159320 |假| 1|
|2017-02-04 | 09:56:00 | 1486198560 |正确| 2|
|2017-02-04 | 10:01:00 | 1486198860 |假| 2|
|2017-02-04 | 10:06:00 | 1486199160 |假| 2|
|2017-02-04 | 10:11:00 | 1486199460 |假| 2|
|2017-02-04 | 10:16:00 | 1486199760 |假| 2|
|2017-02-04 | 10:21:00 | 14862000060 |假| 2|
|2017-02-04 | 10:26:00 | 1486200360 |假| 2|
|2017-02-04 | 10:31:00 | 1486200660 |假| 2|
|2017-02-04 | 10:36:00 | 1486200960 |假| 2|
|2017-02-04 | 16:57:00 | 1486223820 |正确| 3|
|2017-02-04 | 17:12:00 | 1486224720 |正确| 4|
+----------+--------+----------+----------+--------+

第一个窗口函数是计算两个连续时间戳之间的时间增量（以分钟为单位）
第二个，允许我们通过计算累积和来创建唯一的组标识符。每次间隙大于5分钟时，它将增加1

然后，您可以计算每个组中的元素数

df\
.groupBy（'group_id'））\
.agg（psf.first（'date'）。别名（'date'），psf.count（'*'）。别名（'nb'））\
.show（）
+--------+----------+---+
|组id |日期| nb|
+--------+----------+---+
|       1|2017-02-03|  2|
|       2|2017-02-04|  9|
|       3|2017-02-04|  1|
|       4|2017-02-04|  1|
+--------+----------+---+

您可以将dataframe API与

窗口一起使用

函数：

import pyspark.sql.函数作为psf
从pyspark.sql导入窗口
w=Window.orderBy（'datetime'））
df\
.withColumn（'datetime'，psf.unix_时间戳（psf.concat（'date'，psf.lit（''，'time'））.cast（'timestamp'））\
.withColumn（'5min_delta'，（psf.col（'datetime'）-psf.lag（'datetime'）。超过（w））/60>5）\
.fillna（对）\
.withColumn（'group_id'，psf.sum（psf.col（'5min_delta'））.cast（'int'））.over（w））.show（）
+----------+--------+----------+----------+--------+
|日期|时间|日期时间| 5分钟|增量|组id|
+----------+--------+----------+----------+--------+
|2017-02-03 | 22:57:00 | 1486159020 |正确| 1|
|2017-02-03 | 23:02:00 | 1486159320 |假| 1|
|2017-02-04 | 09:56:00 | 1486198560 |正确| 2|
|2017-02-04 | 10:01:00 | 1486198860 |假| 2|
|2017-02-04 | 10:06:00 | 1486199160 |假| 2|
|2017-02-04 | 10:11:00 | 1486199460 |假| 2|
|2017-02-04 | 10:16:00 | 1486199760 |假| 2|
|2017-02-04 | 10:21:00 | 14862000060 |假| 2|
|2017-02-04 | 10:26:00 | 1486200360 |假| 2|
|2017-02-04 | 10:31:00 | 1486200660 |假| 2|
|2017-02-04 | 10:36:00 | 1486200960 |假| 2|
|2017-02-04 | 16:57:00 | 1486223820 |正确| 3|
|2017-02-04 | 17:12:00 | 1486224720 |正确| 4|
+----------+--------+----------+----------+--------+

第一个窗口函数是计算两个连续时间戳之间的时间增量（以分钟为单位）
第二个，允许我们通过计算累积和来创建唯一的组标识符。每次间隙大于5分钟时，它将增加1

然后，您可以计算每个组中的元素数

df\
.groupBy（'group_id'））\
.agg（psf.first（'date'）。别名（'date'），psf.count（'*'）。别名（'nb'））\
.show（）
+--------+----------+---+
|组id |日期| nb|
+--------+----------+---+
|       1|2017-02-03|  2|
|       2|2017-02-04|  9|
|       3|2017-02-04|  1|
|       4|2017-02-04|  1|
+--------+----------+---+

谢谢您的回复。我的讲座向我展示了如何不用DataFrameAPI来实现它。我将尝试重新创建它并发布到这里。dataframe API由于使用了catalyst，因此性能更高。这一切都是关于对结构化对象的操作，因此不那么随意，但速度更快。您可以在jupyter手机中使用%%timeit，或者通过阅读此感谢您的回复来说服自己。我的讲座向我展示了如何不用DataFrameAPI来实现它。我将尝试重新创建它并发布到这里。dataframe API由于使用了catalyst，因此性能更高。这一切都是关于对结构化对象的操作，因此不那么随意，但速度更快。您可以在jupyter手机中使用%%timeit，或通过阅读以下内容来说服自己