Apache spark 大量窗口函数和超大数据集上的Spark性能问题

Apache spark 大量窗口函数和超大数据集上的Spark性能问题,apache-spark,pyspark,Apache Spark,Pyspark,我正在生成大约30个窗口函数,并在一个相当大的数据集(15亿条记录)上运行它,这相当于14天的数据量。如果我在1天内运行它,那么大约100个IO记录需要27个小时来计算。。这让我觉得一定是出了什么事。作为比较,我在同一数据集上进行的连接需要2分钟 # Window Time = 30min window_time = 1800 # TCP ports ports = ['22', '25', '53', '80', '88', '123', '514', '443', '8080', '844

我正在生成大约30个窗口函数,并在一个相当大的数据集(15亿条记录)上运行它,这相当于14天的数据量。如果我在1天内运行它,那么大约100个IO记录需要27个小时来计算。。这让我觉得一定是出了什么事。作为比较,我在同一数据集上进行的连接需要2分钟

# Window Time = 30min
window_time = 1800

# TCP ports
ports = ['22', '25', '53', '80', '88', '123', '514', '443', '8080', '8443']

# Stats fields for window
stat_fields = ['source_bytes', 'destination_bytes', 'source_packets', 'destination_packets']

def add_port_column(r_df, port, window):
    '''
    Input:
        r_df: dataframe
        port: port
        window: pyspark window to be used

    Output: pyspark dataframe
    '''

    return r_df.withColumn('pkts_src_port_{}_30m'.format(port), F.when(F.col('source_port') == port, F.sum('source_packets').over(window)).otherwise(0))\
        .withColumn('pkts_dst_port_{}_30m'.format(port), F.when(F.col('destination_port') == port, F.sum('destination_packets').over(window)).otherwise(0))

def add_stats_column(r_df, field, window):
    '''
    Input:
        r_df: dataframe
        field: field to generate stats with
        window: pyspark window to be used
    '''

    r_df = r_df \
        .withColumn('{}_sum_30m'.format(field), F.sum(field).over(window))\
        .withColumn('{}_avg_30m'.format(field), F.avg(field).over(window))\
        .withColumn('{}_std_30m'.format(field), F.stddev(field).over(window))\
        .withColumn('{}_min_30m'.format(field), F.min(field).over(window))\
        .withColumn('{}_max_30m'.format(field), F.max(field).over(window))\
        .withColumn('{}_q25_30m'.format(field), F.expr("percentile_approx('{}', 0.25)".format(field)).over(window))\
        .withColumn('{}_q75_30m'.format(field), F.expr("percentile_approx('{}', 0.75)".format(field)).over(window))

    return r_df

w_s = (Window()
     .partitionBy("ip")
     .orderBy(F.col("timestamp"))
     .rangeBetween(-window_time, 0))

flows_filtered_v3_df = (reduce(partial(add_port_column,window=w_s),
    ports,
    flows_filtered_v3_df
))
#
flows_filtered_v3_df = (reduce(partial(add_stats_column,window=w_s),
    stat_fields,
    flows_filtered_v3_df
))
对ip(我选择的分区)进行聚合计数

我想知道我怎么能在这里加快速度,或者我做错了什么,以至于需要花很长时间来计算

编辑:

添加一些统计数据

6个spark节点-总共1 TB内存/252个内核 Spark版本:2.4.0-cdh6.3.1

指定的选项

org.apache.spark.deploy.SparkSubmit --conf spark.executor.memory=8g --conf spark.driver.memory=8g --conf spark.local.dir=/pkgs/cdh/tmp/spark --conf spark.yarn.security.tokens.hive.enabled=false --conf spark.yarn.security.credentials.hadoopfs.enabled=false --conf spark.security.credentials.hive.enabled=false --conf spark.app.name=DSS (Py): compute_flows_window_pyspark_2020-04-14 --conf spark.io.compression.codec=snappy --conf spark.sql.shuffle.partitions=40 --conf spark.shuffle.spill.compress=false --conf spark.shuffle.compress=false --conf spark.dku.limitedLogs={"filePartitioner.noMatch":100,"s3.ignoredPath":100,"s3.ignoredFile":100} --conf spark.security.credentials.hadoopfs.enabled=false --conf spark.jars.repositories=https://nexus.bisinfo.org:8443/repository/maven-central --conf spark.yarn.executor.memoryOverhead=600


请转到spark UI并更新dsg图进行分析,同时添加spark版本群集大小内存分配等,以获得输入的正确答案。我修改了我的问题
org.apache.spark.deploy.SparkSubmit --conf spark.executor.memory=8g --conf spark.driver.memory=8g --conf spark.local.dir=/pkgs/cdh/tmp/spark --conf spark.yarn.security.tokens.hive.enabled=false --conf spark.yarn.security.credentials.hadoopfs.enabled=false --conf spark.security.credentials.hive.enabled=false --conf spark.app.name=DSS (Py): compute_flows_window_pyspark_2020-04-14 --conf spark.io.compression.codec=snappy --conf spark.sql.shuffle.partitions=40 --conf spark.shuffle.spill.compress=false --conf spark.shuffle.compress=false --conf spark.dku.limitedLogs={"filePartitioner.noMatch":100,"s3.ignoredPath":100,"s3.ignoredFile":100} --conf spark.security.credentials.hadoopfs.enabled=false --conf spark.jars.repositories=https://nexus.bisinfo.org:8443/repository/maven-central --conf spark.yarn.executor.memoryOverhead=600