Python Spark submit引发GC内存不足异常
我是新手。我们有下面的配置单元查询,在它上面,我们通过使用spark和python执行pivot操作 下面pyspark脚本执行一些透视操作并写入配置单元表。配置单元查询返回1.4亿行 方法1Python Spark submit引发GC内存不足异常,python,apache-spark,hive,pyspark,pyspark-dataframes,Python,Apache Spark,Hive,Pyspark,Pyspark Dataframes,我是新手。我们有下面的配置单元查询,在它上面,我们通过使用spark和python执行pivot操作 下面pyspark脚本执行一些透视操作并写入配置单元表。配置单元查询返回1.4亿行 方法1 from pyspark import SparkContext from pyspark import HiveContext from pyspark.sql import functions as F sc = SparkContext() hc = HiveContext(sc) tbl = hc
from pyspark import SparkContext
from pyspark import HiveContext
from pyspark.sql import functions as F
sc = SparkContext()
hc = HiveContext(sc)
tbl = hc.sql("""
Select Rating.BranchID
, Rating.Vehicle
, Rating.PersonalAutoCov
, Rating.PersonalVehicleCov
, Rating.EffectiveDate
, Rating.ExpirationDate
, attr.name as RatingAttributeName
, Cast(Rating.OutputValue as Int) OutputValue
, Rating.InputValue
From db.dbo_pcx_paratingdata_piext_master rating
Inner Join db.dbo_pctl_ratingattrname_piext_master attr
on rating.RatingAttribute = attr.id
and attr.CurrentRecordIndicator = 'Y'
Where
rating.CurrentRecordIndicator = 'Y'
""")
tbl.cache()
pvttbl1 = tbl.groupby("BranchId","Vehicle","PersonalAutoCov","PersonalVehicleCov","EffectiveDate","ExpirationDate")\
.pivot("RatingAttributeName")\
.agg({"InputValue":"max", "OutputValue":"sum"})
pvttbl1.createOrReplaceTempView("paRatingAttributes")
hc.sql("Create table dev_pekindataaccesslayer.createcount as select * from paRatingAttributes")
当我使用spark submit命令运行上面的脚本时,我的结果是
java.lang.OutOfMemoryError:java堆空间
有时候
java.lang.OutOfMemoryError:超出GC开销限制
我使用的spark提交命令
spark-submit spark_ex2.py --master yarn-cluster --num-executors 15 --executor-cores 50 --executor-memory 100g --driver-memory 100g, --conf `"spark.sql.shuffle.partitions=1000", --conf "spark.memory.offHeap.enabled=true", --conf "spark.memory.offHeap.size=100g",--conf "spark.network.timeout =1200", --conf "spark.executor.heartbeatInterval=1201"`
详细日志:
INFO MemoryStore: Memory use = 1480.9 KB (blocks) + 364.8 MB (scratch space shared across 40 tasks(s)) = 366.2 MB.
Storage limit = 366.3 MB.
WARN BlockManager: Persisting block rdd_11_22 to disk instead.
WARN BlockManager: Putting block rdd_11_0 failed due to an exception
WARN BlockManager: Block rdd_11_0 could not be removed as it was not found on disk or in memory
ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 10)
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
但是上面的脚本涉及到中间表的创建,这是一个附加步骤。
在方法2中,当我使用相同的spark submit命令保持limit关键字时,命令工作正常
我的方法1出了什么问题?我如何才能使它起作用
注意:我已经遵循并尝试了所有建议的conf参数,但仍然没有成功。以防我最后的评论不清楚。我相信您可能有一些类似于
java-Xmx100G-Xmx2G whateverthesparkmainclassiscaled
,这将导致-Xmx2G
取代-Xmx100G
,这将解释内存错误的原因。因此,查看确切的java命令行非常重要ps-eo args | grep java
可用于查找完整命令line@Tharunkumar雷迪。。您能否在pivot之前向我们展示一些“tbl”的示例数据,以及对其使用pivot操作后的期望值。你能解释一下那部分吗。Pivot是一种昂贵的操作,也许我们可以尝试其他方法来做同样的事情。只是一个想法。@vikrantrana,具有以下配置”--conf spark.thread.appMasterEnv.spark\u HOME=/dev/null“对我来说一切都很好。你知道这个参数的作用吗?@Tharunkumar Reddy。。不知道。我会检查一下,但我在pivot上发现了一些有趣的东西。可能有用。参数--conf spark.warn.appMasterEnv.spark\u HOME=/dev/null
将环境变量spark\u HOME
设置为/dev/null。换句话说,它将取消设置任何现有的SPARK\u HOME
。我不知道为什么这能解决问题。如果我不得不猜测的话,我会说要么你现有的SPARK_HOME中有一个坏的配置文件,要么SPARK_HOME中有一些JAR现在没有加载,这会释放内存。
from pyspark import SparkContext
from pyspark import HiveContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
sc = SparkContext()
hc = HiveContext(sc)
sqlContext = SQLContext(sc)
tbl = hc.sql("""
Select Rating.BranchID
, Rating.Vehicle
, Rating.PersonalAutoCov
, Rating.PersonalVehicleCov
, Rating.EffectiveDate
, Rating.ExpirationDate
, attr.name as RatingAttributeName
, Cast(Rating.OutputValue as Int) OutputValue
, Rating.InputValue
From db.dbo_pcx_paratingdata_piext_master rating
Inner Join db.dbo_pctl_ratingattrname_piext_master attr
on rating.RatingAttribute = attr.id
and attr.CurrentRecordIndicator = 'Y'
Where
rating.CurrentRecordIndicator = 'Y'
""")
tbl.createOrReplaceTempView("Ptable")
r=sqlContext.sql("select count(1) from Ptable")
m=r.collect()[0][0]
hc.sql("drop table if exists db.Ptable")
hc.sql("Create table db.Ptable as select * from Ptable")
tb2 = hc.sql("select * from db.Ptable limit "+str(m))
pvttbl1 = tb2.groupby("BranchId","Vehicle","PersonalAutoCov","PersonalVehicleCov","EffectiveDate","ExpirationDate")\
.pivot("RatingAttributeName")\
.agg({"InputValue":"max", "OutputValue":"sum"})
pvttbl1.createOrReplaceTempView("paRatingAttributes")
hc.sql("drop table if exists db.createcount")
hc.sql("Create table db.createcount STORED AS ORC as select * from paRatingAttributes")