Python 尝试持久化数据帧时内存不足_Python_Apache Spark_Pyspark_Parquet

Python 尝试持久化数据帧时内存不足

python apache-spark pyspark

Python 尝试持久化数据帧时内存不足,python,apache-spark,pyspark,parquet,Python,Apache Spark,Pyspark,Parquet,我在尝试持久化数据帧时遇到内存不足错误，我真的不明白为什么。我有一个大约20Gb的数据帧，有250万行和大约20列。过滤完这个数据帧后，我有4列和50万行现在我的问题是，当我持久化过滤后的数据帧时，会出现内存不足错误（使用的20 Gb物理内存超过25.4Gb）。我尝试了在不同的存储级别上进行持久化 df = spark.read.parquet(path) # 20 Gb df_filter = df.select('a', 'b', 'c', 'd').where(df.a == somet

我在尝试持久化数据帧时遇到内存不足错误，我真的不明白为什么。我有一个大约20Gb的数据帧，有250万行和大约20列。过滤完这个数据帧后，我有4列和50万行

现在我的问题是，当我持久化过滤后的数据帧时，会出现内存不足错误（使用的20 Gb物理内存超过25.4Gb）。我尝试了在不同的存储级别上进行持久化

df = spark.read.parquet(path) # 20 Gb
df_filter = df.select('a', 'b', 'c', 'd').where(df.a == something) # a few Gb
df_filter.persist(StorageLevel.MEMORY_AND_DISK) 
df_filter.count()

我的群集有8个节点，每个节点的内存为30Gb

你知道OOM可能来自哪里吗？

只是一些建议来帮助确定根本原因

你可能有（或一个组合）的

歪斜的源数据分区分割大小很难处理，并导致垃圾收集、OOM等（这些方法对我有所帮助，但每个用例可能有更好的方法）

配置中设置的执行器/ram/内核太少/太多

宽转换洗牌大小太小/太多=>请尝试常规调试检查，以查看在持久化并查找到磁盘的输出分区时将触发的转换

胡乱猜测：是否有可能

df_过滤器

最初只是df的一个视图，但在内部

persist

调用了

.copy（）

（为什么会这样做，我不知道，但仍然有可能）然后导致OOM？没有

persist

，同样的错误？谢谢您的回答。不，如果没有

persist

，我实际上不会收到任何错误。请尝试df\u filter.persist（StorageLevel.MEMORY\u和\u DISK\u SER）。count（）您对

.persist（仅限StorageLevel.DISK\u）有相同的问题吗？

？谢谢purplepython。似乎是利用了分区的数量。特别是使用经验法则

NumPartitions=numpus*4

，如本文所述，欢迎您。。。谢谢分享这篇文章。

# to check num partitions
df_filter.rdd.getNumPartitions()

# to repartition (**does cause shuffle**) to increase parallelism and help with data skew
df_filter.repartition(...) # monitor/debug performance in spark ui after setting

# check via
spark.sparkContext.getConf().getAll()

# these are the ones you want to watch out for
'''
--num-executors
--executor-cores
--executor-memory
'''

# debug directed acyclic graph [dag]
df_filter.explain() # also "babysit" in spark UI to examine performance of each node/partitions to get specs when you are persisting

# check output partitions if shuffle occurs
spark.conf.get("spark.sql.shuffle.partitions")