Pyspark:stddev和分位数的窗口函数生成NaN和Null_Pyspark

Pyspark:stddev和分位数的窗口函数生成NaN和Null

pyspark

Pyspark:stddev和分位数的窗口函数生成NaN和Null,pyspark,Pyspark,试图计算stddev和25,75个分位数，但它们产生NaN和Null值 # Window Time = 30min window_time = 1800 # Stats fields for window stat_fields = ['source_packets', 'destination_packets'] df = sqlContext.createDataFrame([('192.168.1.1','10.0.0.1',22,51000, 17, 1, "2017-03

试图计算stddev和25,75个分位数，但它们产生NaN和Null值

# Window Time = 30min
window_time = 1800

# Stats fields for window
stat_fields = ['source_packets', 'destination_packets']

df = sqlContext.createDataFrame([('192.168.1.1','10.0.0.1',22,51000, 17, 1, "2017-03-10T15:27:18+00:00"),
                        ('192.168.1.2','10.0.0.2',51000,22, 1,2, "2017-03-15T12:27:18+00:00"),
                        ('192.168.1.2','10.0.0.2',53,51000, 2,3, "2017-03-15T12:28:18+00:00"),
                        ('192.168.1.2','10.0.0.2',51000,53, 3,4, "2017-03-15T12:29:18+00:00"),
                        ('192.168.1.3','10.0.0.3',80,51000, 4,5, "2017-03-15T12:28:18+00:00"),
                        ('192.168.1.3','10.0.0.3',51000,80, 5,6, "2017-03-15T12:29:18+00:00"),
                      ('192.168.1.3','10.0.0.3',22,51000, 25,7, "2017-03-18T11:27:18+00:00")],
                        ["source_ip","destination_ip","source_port","destination_port", "source_packets", "destination_packets", "timestampGMT"])

def add_stats_column(r_df, field, window):
    '''
    Input:
        r_df: dataframe
        field: field to generate stats with
        window: pyspark window to be used
    '''

    r_df = r_df \
        .withColumn('{}_sum_30m'.format(field), F.sum(field).over(window))\
        .withColumn('{}_avg_30m'.format(field), F.avg(field).over(window))\
        .withColumn('{}_std_30m'.format(field), F.stddev(field).over(window))\
        .withColumn('{}_min_30m'.format(field), F.min(field).over(window))\
        .withColumn('{}_max_30m'.format(field), F.max(field).over(window))\
        .withColumn('{}_q25_30m'.format(field), F.expr("percentile_approx('{}', 0.25)".format(field)).over(window))\
        .withColumn('{}_q75_30m'.format(field), F.expr("percentile_approx('{}', 0.75)".format(field)).over(window))

    return r_df

w_s = (Window()
     .partitionBy("ip")
     .orderBy(F.col("timestamp"))
     .rangeBetween(-window_time, 0))

df2 = df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
    .withColumn("arr",F.array(F.col("source_ip"),F.col("destination_ip")))\
    .selectExpr("explode(arr) as ip","*")\
    .drop(*['arr','source_ip','destination_ip'])

df2 = (reduce(partial(add_stats_column,window=w_s),
    stat_fields,
    df2
))

#print(df2.explain())
df2.show(100)

输出

+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
|         ip|source_port|destination_port|source_packets|destination_packets|        timestampGMT| timestamp|source_packets_sum_30m|source_packets_avg_30m|source_packets_std_30m|source_packets_min_30m|source_packets_max_30m|source_packets_q25_30m|source_packets_q75_30m|destination_packets_sum_30m|destination_packets_avg_30m|destination_packets_std_30m|destination_packets_min_30m|destination_packets_max_30m|destination_packets_q25_30m|destination_packets_q75_30m|
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
|192.168.1.3|         80|           51000|             4|                  5|2017-03-15T12:28:...|1489580898|                     4|                   4.0|                   NaN|                     4|                     4|                  null|                  null|                          5|                        5.0|                        NaN|                          5|                          5|                       null|                       null|
|192.168.1.3|      51000|              80|             5|                  6|2017-03-15T12:29:...|1489580958|                     9|                   4.5|    0.7071067811865476|                     4|                     5|                  null|                  null|                         11|                        5.5|         0.7071067811865476|                          5|                          6|                       null|                       null|
|192.168.1.3|         22|           51000|            25|                  7|2017-03-18T11:27:...|1489836438|                    25|                  25.0|                   NaN|                    25|                    25|                  null|                  null|                          7|                        7.0|                        NaN|                          7|                          7|                       null|                       null|
|   10.0.0.1|         22|           51000|            17|                  1|2017-03-10T15:27:...|1489159638|                    17|                  17.0|                   NaN|                    17|                    17|                  null|                  null|                          1|                        1.0|                        NaN|                          1|                          1|                       null|                       null|
|   10.0.0.2|      51000|              22|             1|                  2|2017-03-15T12:27:...|1489580838|                     1|                   1.0|                   NaN|                     1|                     1|                  null|                  null|                          2|                        2.0|                        NaN|                          2|                          2|                       null|                       null|
|   10.0.0.2|         53|           51000|             2|                  3|2017-03-15T12:28:...|1489580898|                     3|                   1.5|    0.7071067811865476|                     1|                     2|                  null|                  null|                          5|                        2.5|         0.7071067811865476|                          2|                          3|                       null|                       null|
|   10.0.0.2|      51000|              53|             3|                  4|2017-03-15T12:29:...|1489580958|                     6|                   2.0|                   1.0|                     1|                     3|                  null|                  null|                          9|                        3.0|                        1.0|                          2|                          4|                       null|                       null|
|   10.0.0.3|         80|           51000|             4|                  5|2017-03-15T12:28:...|1489580898|                     4|                   4.0|                   NaN|                     4|                     4|                  null|                  null|                          5|                        5.0|                        NaN|                          5|                          5|                       null|                       null|
|   10.0.0.3|      51000|              80|             5|                  6|2017-03-15T12:29:...|1489580958|                     9|                   4.5|    0.7071067811865476|                     4|                     5|                  null|                  null|                         11|                        5.5|         0.7071067811865476|                          5|                          6|                       null|                       null|
|   10.0.0.3|         22|           51000|            25|                  7|2017-03-18T11:27:...|1489836438|                    25|                  25.0|                   NaN|                    25|                    25|                  null|                  null|                          7|                        7.0|                        NaN|                          7|                          7|                       null|                       null|
|192.168.1.2|      51000|              22|             1|                  2|2017-03-15T12:27:...|1489580838|                     1|                   1.0|                   NaN|                     1|                     1|                  null|                  null|                          2|                        2.0|                        NaN|                          2|                          2|                       null|                       null|
|192.168.1.2|         53|           51000|             2|                  3|2017-03-15T12:28:...|1489580898|                     3|                   1.5|    0.7071067811865476|                     1|                     2|                  null|                  null|                          5|                        2.5|         0.7071067811865476|                          2|                          3|                       null|                       null|
|192.168.1.2|      51000|              53|             3|                  4|2017-03-15T12:29:...|1489580958|                     6|                   2.0|                   1.0|                     1|                     3|                  null|                  null|                          9|                        3.0|                        1.0|                          2|                          4|                       null|                       null|
|192.168.1.1|         22|           51000|            17|                  1|2017-03-10T15:27:...|1489159638|                    17|                  17.0|                   NaN|                    17|                    17|                  null|                  null|                          1|                        1.0|                        NaN|                          1|                          1|                       null|                       null|
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+

从pyspark api文档中，我们可以得到：

pyspark.sql.functions.stddev(col)
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.

New in version 1.6.

pyspark.sql.functions.stddev_pop(col)
Aggregate function: returns population standard deviation of the expression in a group.

New in version 1.6.

pyspark.sql.functions.stddev_samp(col)
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.

New in version 1.6.

因此，也许您可以尝试

stddev\u pop

：总体标准偏差，而不是

stddev

：无偏样本标准偏差。当只有一个样本时，无偏样本标准偏差会导致除零误差（get

NaN

）