Pyspark优化

Pyspark优化,pyspark,Pyspark,我对spark和pyspark非常陌生,尝试在大小超过1 bil的数据帧上执行许多操作。记录。我在找势的指针 # TCP ports ports = ['22', '25', '53', '80', '88', '123', '514', '443', '8080', '8443'] # Stats fields for window stat_fields = ['source_bytes', 'destination_bytes', 'source_packets', 'destinati

我对spark和pyspark非常陌生,尝试在大小超过1 bil的数据帧上执行许多操作。记录。我在找势的指针

# TCP ports
ports = ['22', '25', '53', '80', '88', '123', '514', '443', '8080', '8443']

# Stats fields for window
stat_fields = ['source_bytes', 'destination_bytes', 'source_packets', 'destination_packets']

def add_port_column(r_df, port, window):
    '''
    Input:
        r_df: dataframe
        port: port
        window: pyspark window to be used

    Output: pyspark dataframe
    '''

    return r_df.withColumn('pkts_src_port_{}_30m'.format(port), F.when(F.col('source_port') == port, F.sum('source_packets').over(window)).otherwise(0))\
        .withColumn('pkts_dst_port_{}_30m'.format(port), F.when(F.col('destination_port') == port, F.sum('destination_packets').over(window)).otherwise(0))

def add_stats_column(r_df, field, window):
    '''
    Input:
        r_df: dataframe
        field: field to generate stats with
        window: pyspark window to be used
    '''

    r_df = r_df \
        .withColumn('{}_sum_30m'.format(field), F.sum(field).over(window))\
        .withColumn('{}_avg_30m'.format(field), F.avg(field).over(window))\
        .withColumn('{}_std_30m'.format(field), F.stddev(field).over(window))\
        .withColumn('{}_min_30m'.format(field), F.min(field).over(window))\
        .withColumn('{}_max_30m'.format(field), F.max(field).over(window))\
        .withColumn('{}_q25_30m'.format(field), F.expr("percentile_approx('{}', 0.25)".format(field)).over(window))\
        .withColumn('{}_q75_30m'.format(field), F.expr("percentile_approx('{}', 0.75)".format(field)).over(window))

    return r_df

# Sliding Window 30 minutes
w = (Window()
     .partitionBy("source_ip")
     .orderBy(F.col("timestamp"))
     .rangeBetween(-window_time, 0))

w_s = (Window()
     .partitionBy("ip")
     .orderBy(F.col("timestamp"))
     .rangeBetween(-window_time, 0))

flows_filtered_v3_df.printSchema()

# Spliting 1 row into 2 rows source_ip and destination_ip becomes ip.
flows_filtered_v3_df = flows_filtered_v3_df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("start_time"))) \
    .withColumn("arr",F.array(F.col("source_ip"),F.col("destination_ip")))\
    .selectExpr("explode(arr) as ip","*")\
    .drop(*['arr','source_ip','destination_ip'])

flows_filtered_v3_df = (reduce(partial(add_port_column,window=w_s),
    ports,
    flows_filtered_v3_df
))

flows_filtered_v3_df = (reduce(partial(add_stats_column,window=w_s),
    stat_fields,
    flows_filtered_v3_df
))
我基本上是尝试做一些30分钟的窗口聚合,并将一行拆分为2行。差不多

|source_ip|destination_ip|source_port|destinatin_port|...
|192.168.1.1|10.10.0.1|5000|22|...
进入

然后进行所有窗口计算。我想知道我是否应该在拆分之前更好地进行计算,在所有wsith前面加上src_uu或dst_u,然后相应地拆分

|ip|source_port|destination_port|...
|192.168.1.1|5000|22|..
|10.10.0.1|5000|22|..