Amazon s3 Pyspark流媒体应用程序在批处理过程中阻塞

Amazon s3 Pyspark流媒体应用程序在批处理过程中阻塞,amazon-s3,pyspark,spark-streaming,amazon-kinesis,Amazon S3,Pyspark,Spark Streaming,Amazon Kinesis,我有一个pyspark应用程序,它从Kinesis加载数据并保存到S3 每个批次的处理时间都相当稳定,但之后可能会卡住。 我怎么才能知道为什么会这样 代码示例: columns = [x.name for x in schema] Event = Row(*[x[0] for x in columns]) def get_spark_session_instance(sparkConf): if ("sparkSessionSingletonInstance" not in glob

我有一个pyspark应用程序,它从Kinesis加载数据并保存到S3

每个批次的处理时间都相当稳定,但之后可能会卡住。 我怎么才能知道为什么会这样

代码示例:

columns = [x.name for x in schema]
Event = Row(*[x[0] for x in columns])


def get_spark_session_instance(sparkConf):
    if ("sparkSessionSingletonInstance" not in globals()):
        globals()["sparkSessionSingletonInstance"] = SparkSession \
            .builder \
            .config(conf=sparkConf) \
            .getOrCreate()
    return globals()["sparkSessionSingletonInstance"]


def creating_func():
    def timing(message):
        print('timing', str(datetime.utcnow()), message)

    def process_game(df, game, time_part):

        # s3
        df.write.json("{}/{}/{}/{}".format(path_prefix, game, 'group_1', time_part),
                      compression="gzip", timestampFormat="yyyy-MM-dd'T'HH:mm:ss.SSS")
        timing('{}_grop_1'.format(game))

        df[df['group'] == 2] \
            .write.json("{}/{}/{}/{}".format(path_prefix, game, 'group_2', time_part),
                        compression="gzip", timestampFormat="yyyy-MM-dd'T'HH:mm:ss.SSS")
        timing('{}_grop_2'.format(game))

        # database
        df[df['group'] == 3].select(*db_columns) \
            .write.jdbc(db_connection_string, table="test.{}group_3".format(game), mode='append',
                        properties=db_connection_propetries)
        timing('{}_db'.format(game))

    def event_to_row(event):
        event_dict = json.loads(event)
        event_dict['json_data'] = event_dict.get('json_data') and json.dumps(
            event_dict.get('json_data'))
        return Event(*[event_dict.get(x) for x in columns])

    def process(rdd):
        if not rdd.isEmpty():

            spark_time = datetime.utcnow().strftime('%Y/%m/%d/%H/%M%S_%f')

            rows_rdd = rdd.map(event_to_row)
            spark = get_spark_session_instance(rdd.context.getConf())
            df = spark.createDataFrame(data=rows_rdd, schema=schema)
            df = df.withColumn("ts", df["ts"].cast(TimestampType())) \
                .withColumn("processing_time", lit(datetime.utcnow()))

            df.cache()

            print('timing -----------------------------')

            process_game(df[df['app_id'] == 1], 'app_1', spark_time)
            process_game(df[df['app_id'] == 2], 'app_2', spark_time)

    sc = SparkContext.getOrCreate()
    ssc = StreamingContext(sc, 240)
    kinesis_stream = KinesisUtils.createStream(
        ssc, sys.argv[2], 'My-stream-name', "kinesis.us-east-1.amazonaws.com",
        'us-east-1', InitialPositionInStream.TRIM_HORIZON, 240, StorageLevel.MEMORY_AND_DISK_2)

    kinesis_stream.repartition(16 * 3).foreachRDD(process)

    ssc.checkpoint(checkpoint_prefix + sys.argv[1])
    return ssc

if __name__ == '__main__':
    print('timing', 'cast ts', str(datetime.utcnow()))

    ssc = StreamingContext.getActiveOrCreate(checkpoint_prefix + sys.argv[1], creating_func)

    ssc.start()
    ssc.awaitTermination()

确定花时间的进程,使用kill-QUIT或jstack获取堆栈跟踪。查看源代码中可能出现的延迟,并考虑可以在何处增加Log4J日志以获取更多信息。
延迟是否随写入数据量的增加而增加?如果是这样,这就是s3通常遇到的“重命名是真正的复制”问题

>延迟是否会随着写入的数据量而增加?正常的执行时间是40秒,但当它挂起的时候,它可能会被冻结一整晚,而且从来没有见过它解冻。然后是别的什么。恐怕是时候开始基于堆栈跟踪的调试了,看起来有一个工人被冻结了。因为我的S3文件夹中的文件少于此批处理的文件,并且没有成功文件