Pyspark:stddev和分位数的窗口函数生成NaN和Null
试图计算stddev和25,75个分位数,但它们产生NaN和Null值Pyspark:stddev和分位数的窗口函数生成NaN和Null,pyspark,Pyspark,试图计算stddev和25,75个分位数,但它们产生NaN和Null值 # Window Time = 30min window_time = 1800 # Stats fields for window stat_fields = ['source_packets', 'destination_packets'] df = sqlContext.createDataFrame([('192.168.1.1','10.0.0.1',22,51000, 17, 1, "2017-03
# Window Time = 30min
window_time = 1800
# Stats fields for window
stat_fields = ['source_packets', 'destination_packets']
df = sqlContext.createDataFrame([('192.168.1.1','10.0.0.1',22,51000, 17, 1, "2017-03-10T15:27:18+00:00"),
('192.168.1.2','10.0.0.2',51000,22, 1,2, "2017-03-15T12:27:18+00:00"),
('192.168.1.2','10.0.0.2',53,51000, 2,3, "2017-03-15T12:28:18+00:00"),
('192.168.1.2','10.0.0.2',51000,53, 3,4, "2017-03-15T12:29:18+00:00"),
('192.168.1.3','10.0.0.3',80,51000, 4,5, "2017-03-15T12:28:18+00:00"),
('192.168.1.3','10.0.0.3',51000,80, 5,6, "2017-03-15T12:29:18+00:00"),
('192.168.1.3','10.0.0.3',22,51000, 25,7, "2017-03-18T11:27:18+00:00")],
["source_ip","destination_ip","source_port","destination_port", "source_packets", "destination_packets", "timestampGMT"])
def add_stats_column(r_df, field, window):
'''
Input:
r_df: dataframe
field: field to generate stats with
window: pyspark window to be used
'''
r_df = r_df \
.withColumn('{}_sum_30m'.format(field), F.sum(field).over(window))\
.withColumn('{}_avg_30m'.format(field), F.avg(field).over(window))\
.withColumn('{}_std_30m'.format(field), F.stddev(field).over(window))\
.withColumn('{}_min_30m'.format(field), F.min(field).over(window))\
.withColumn('{}_max_30m'.format(field), F.max(field).over(window))\
.withColumn('{}_q25_30m'.format(field), F.expr("percentile_approx('{}', 0.25)".format(field)).over(window))\
.withColumn('{}_q75_30m'.format(field), F.expr("percentile_approx('{}', 0.75)".format(field)).over(window))
return r_df
w_s = (Window()
.partitionBy("ip")
.orderBy(F.col("timestamp"))
.rangeBetween(-window_time, 0))
df2 = df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
.withColumn("arr",F.array(F.col("source_ip"),F.col("destination_ip")))\
.selectExpr("explode(arr) as ip","*")\
.drop(*['arr','source_ip','destination_ip'])
df2 = (reduce(partial(add_stats_column,window=w_s),
stat_fields,
df2
))
#print(df2.explain())
df2.show(100)
输出
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| ip|source_port|destination_port|source_packets|destination_packets| timestampGMT| timestamp|source_packets_sum_30m|source_packets_avg_30m|source_packets_std_30m|source_packets_min_30m|source_packets_max_30m|source_packets_q25_30m|source_packets_q75_30m|destination_packets_sum_30m|destination_packets_avg_30m|destination_packets_std_30m|destination_packets_min_30m|destination_packets_max_30m|destination_packets_q25_30m|destination_packets_q75_30m|
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
|192.168.1.3| 80| 51000| 4| 5|2017-03-15T12:28:...|1489580898| 4| 4.0| NaN| 4| 4| null| null| 5| 5.0| NaN| 5| 5| null| null|
|192.168.1.3| 51000| 80| 5| 6|2017-03-15T12:29:...|1489580958| 9| 4.5| 0.7071067811865476| 4| 5| null| null| 11| 5.5| 0.7071067811865476| 5| 6| null| null|
|192.168.1.3| 22| 51000| 25| 7|2017-03-18T11:27:...|1489836438| 25| 25.0| NaN| 25| 25| null| null| 7| 7.0| NaN| 7| 7| null| null|
| 10.0.0.1| 22| 51000| 17| 1|2017-03-10T15:27:...|1489159638| 17| 17.0| NaN| 17| 17| null| null| 1| 1.0| NaN| 1| 1| null| null|
| 10.0.0.2| 51000| 22| 1| 2|2017-03-15T12:27:...|1489580838| 1| 1.0| NaN| 1| 1| null| null| 2| 2.0| NaN| 2| 2| null| null|
| 10.0.0.2| 53| 51000| 2| 3|2017-03-15T12:28:...|1489580898| 3| 1.5| 0.7071067811865476| 1| 2| null| null| 5| 2.5| 0.7071067811865476| 2| 3| null| null|
| 10.0.0.2| 51000| 53| 3| 4|2017-03-15T12:29:...|1489580958| 6| 2.0| 1.0| 1| 3| null| null| 9| 3.0| 1.0| 2| 4| null| null|
| 10.0.0.3| 80| 51000| 4| 5|2017-03-15T12:28:...|1489580898| 4| 4.0| NaN| 4| 4| null| null| 5| 5.0| NaN| 5| 5| null| null|
| 10.0.0.3| 51000| 80| 5| 6|2017-03-15T12:29:...|1489580958| 9| 4.5| 0.7071067811865476| 4| 5| null| null| 11| 5.5| 0.7071067811865476| 5| 6| null| null|
| 10.0.0.3| 22| 51000| 25| 7|2017-03-18T11:27:...|1489836438| 25| 25.0| NaN| 25| 25| null| null| 7| 7.0| NaN| 7| 7| null| null|
|192.168.1.2| 51000| 22| 1| 2|2017-03-15T12:27:...|1489580838| 1| 1.0| NaN| 1| 1| null| null| 2| 2.0| NaN| 2| 2| null| null|
|192.168.1.2| 53| 51000| 2| 3|2017-03-15T12:28:...|1489580898| 3| 1.5| 0.7071067811865476| 1| 2| null| null| 5| 2.5| 0.7071067811865476| 2| 3| null| null|
|192.168.1.2| 51000| 53| 3| 4|2017-03-15T12:29:...|1489580958| 6| 2.0| 1.0| 1| 3| null| null| 9| 3.0| 1.0| 2| 4| null| null|
|192.168.1.1| 22| 51000| 17| 1|2017-03-10T15:27:...|1489159638| 17| 17.0| NaN| 17| 17| null| null| 1| 1.0| NaN| 1| 1| null| null|
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
从pyspark api文档中,我们可以得到:
pyspark.sql.functions.stddev(col)
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
New in version 1.6.
pyspark.sql.functions.stddev_pop(col)
Aggregate function: returns population standard deviation of the expression in a group.
New in version 1.6.
pyspark.sql.functions.stddev_samp(col)
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
New in version 1.6.
因此,也许您可以尝试stddev\u pop
:总体标准偏差,而不是stddev
:无偏样本标准偏差。
当只有一个样本时,无偏样本标准偏差会导致除零误差(getNaN
)