Apache spark Spark结构化流媒体作业的驱动程序内存不足
我正在尝试运行一个以S3为源的基于文件的结构化流媒体作业。这些文件是JSON格式的。代码将新添加的文件读取到S3文件夹,解析JSON属性,并以拼花格式将数据写回S3。我正在AWS EMR群集(版本-5.29.0)上运行Spark作业Apache spark Spark结构化流媒体作业的驱动程序内存不足,apache-spark,pyspark,apache-spark-sql,spark-streaming,spark-structured-streaming,Apache Spark,Pyspark,Apache Spark Sql,Spark Streaming,Spark Structured Streaming,我正在尝试运行一个以S3为源的基于文件的结构化流媒体作业。这些文件是JSON格式的。代码将新添加的文件读取到S3文件夹,解析JSON属性,并以拼花格式将数据写回S3。我正在AWS EMR群集(版本-5.29.0)上运行Spark作业 流式代码 def writeToOutput(inputDF, batchId): spark.sql("drop table if exists global_temp.source_df") inputDF.cache() input
流式代码
def writeToOutput(inputDF, batchId):
spark.sql("drop table if exists global_temp.source_df")
inputDF.cache()
inputDF.createGlobalTempView("source_df")
df1 = spark.sql(df1_sql)
df2 = spark.sql(df2_sql)
df3 = spark.sql(df3_sql)
df1.repartition(1) \
.write \
.partitionBy("col1", "col2") \
.format("parquet") \
.mode('append') \
.save(output_path + 'df1/')
df2.repartition(1) \
.write \
.partitionBy("col1", "col2") \
.format("parquet") \
.mode('append') \
.save(output_path + 'df2/')
df3.repartition(1) \
.write \
.partitionBy("col1", "col2") \
.format("parquet") \
.mode('append') \
.save(output_path + 'df3/')
inputDF.unpersist()
inputDF = spark \
.readStream \
.schema(jsonSchema) \
.option("latestFirst", "false") \
.option("badRecordsPath", bad_records_path) \
.option("maxFilesPerTrigger", "2000") \
.json(input_path).withColumn('file_path', input_file_name())
query = inputDF.writeStream \
.foreachBatch(writeToOutput) \
.queryName("Stream") \
.option("checkpointLocation", checkpoint_path) \
.trigger(processingTime='180 seconds') \
.start()
query.awaitTermination()
我的spark submit
配置为:
spark submit--master warn--deploy mode cluster--executor memory 5G--executor cores 4--driver memory 15G--num executors 40--conf spark.dynamicAllocation.enabled=false--conf warn.resourcemanager.am.max尝试次数=4--conf spark.warn.am.attemptFailuresValidityInterval=1h--conf spark.executorspark.driver.memoryOverhead=512--conf spark.warn.max.executor.failures=300--conf spark.warn.executor.failuresValidityInterval=1h--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2--conf spark.serializer=org.apache.spark.serializer.KryoSerializer=300--conf spark.streaming.stopgspark.task.maxFailures=8--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-warn.properties--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-warn.properties
def writeToOutput(inputDF, batchId):
spark.sql("drop table if exists global_temp.source_df")
inputDF.cache()
inputDF.createGlobalTempView("source_df")
df1 = spark.sql(df1_sql)
df2 = spark.sql(df2_sql)
df3 = spark.sql(df3_sql)
df1.repartition(1) \
.write \
.partitionBy("col1", "col2") \
.format("parquet") \
.mode('append') \
.save(output_path + 'df1/')
df2.repartition(1) \
.write \
.partitionBy("col1", "col2") \
.format("parquet") \
.mode('append') \
.save(output_path + 'df2/')
df3.repartition(1) \
.write \
.partitionBy("col1", "col2") \
.format("parquet") \
.mode('append') \
.save(output_path + 'df3/')
inputDF.unpersist()
inputDF = spark \
.readStream \
.schema(jsonSchema) \
.option("latestFirst", "false") \
.option("badRecordsPath", bad_records_path) \
.option("maxFilesPerTrigger", "2000") \
.json(input_path).withColumn('file_path', input_file_name())
query = inputDF.writeStream \
.foreachBatch(writeToOutput) \
.queryName("Stream") \
.option("checkpointLocation", checkpoint_path) \
.trigger(processingTime='180 seconds') \
.start()
query.awaitTermination()
我的作业运行了1-2个小时,但由于错误而失败
容器的运行超出了物理内存限制。当前使用情况:使用了15.5 GB物理内存中的15.6 GB;使用了19.1 GB的77.5 GB虚拟内存。压井容器。
我不明白为什么司机需要这么多内存?有人能帮忙吗