Python 如果缓存Spark数据帧,然后覆盖引用,原始数据帧仍会被缓存吗?

Python 如果缓存Spark数据帧,然后覆盖引用,原始数据帧仍会被缓存吗?,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,假设我有一个生成(py)spark数据帧的函数,将数据帧缓存到内存中作为最后一个操作 def gen_func(inputs): df = ... do stuff... df.cache() df.count() return df 据我所知,Spark的缓存工作原理如下: 当对数据调用了cache/persist加上操作(count())时 帧,它从其DAG计算并缓存到内存中,然后粘贴 指向引用它的对象 只要存在对该对象的引用(可能在其他函数/其他作用域内),df将

假设我有一个生成(py)spark数据帧的函数,将数据帧缓存到内存中作为最后一个操作

def gen_func(inputs):
   df = ... do stuff...
   df.cache()
   df.count()
   return df
据我所知,Spark的缓存工作原理如下:

  • 当对数据调用了
    cache/persist
    加上操作(
    count()
    )时 帧,它从其DAG计算并缓存到内存中,然后粘贴 指向引用它的对象
  • 只要存在对该对象的引用(可能在其他函数/其他作用域内),df将继续被缓存,依赖df的所有DAG将使用内存中缓存的数据作为起点
  • 如果删除了对df的所有引用,Spark会将缓存作为要进行垃圾收集的内存。它可能不会立即被垃圾收集,从而导致一些短期内存块(特别是,如果生成缓存数据并过快将其丢弃,则会导致内存泄漏),但最终会被清除
  • 我的问题是,假设我使用
    gen_func
    生成一个数据帧,然后覆盖原始数据帧引用(可能使用
    过滤器
    带列

    在Spark中,RDD/DF是不可变的,因此过滤器后重新分配的DF和过滤器前的DF引用两个完全不同的对象。在这种情况下,对原始df的引用已被覆盖,该df是
    缓存/计数的。这是否意味着缓存的数据帧不再可用,将被垃圾收集?这是否意味着新的post filter
    df
    将从头开始计算所有内容,尽管它是从以前缓存的数据帧生成的


    我问这个问题是因为我最近在用代码修复一些内存不足的问题,在我看来缓存可能是个问题。然而,我还没有真正理解使用缓存的安全方法的全部细节,以及如何意外地使缓存的内存无效。我的理解遗漏了什么?在做上述工作时,我是否偏离了最佳实践?

    我做了两个实验,如下所示。显然,数据帧一旦被缓存,就会保持缓存状态(如
    getPersistentRDDs
    和查询计划-
    InMemory
    等所示),即使使用
    del
    覆盖或删除了所有Python引用,并显式调用了垃圾收集

    实验1:

    def func():
        data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
        data.cache()
        data.count()
        return data
    
    sc._jsc.getPersistentRDDs()
    
    df = func()
    sc._jsc.getPersistentRDDs()
    
    df2 = df.filter('col1 != 2')
    del df
    import gc
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    df2.select('*').explain()
    
    del df2
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o234}
    
    >>> df2 = df.filter('col1 != 2')
    >>> del df
    >>> import gc
    >>> gc.collect()
    93
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o240}
    
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#174L) AND NOT (col1#174L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#174L], [isnotnull(col1#174L), NOT (col1#174L = 2)]
             +- InMemoryRelation [col1#174L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#172L AS col1#174L]
                      +- *(1) Scan ExistingRDD[_1#172L]
    
    >>> del df2
    >>> gc.collect()
    85
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o250}
    
    def func():
        data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
        data.cache()
        data.count()
        return data
    
    sc._jsc.getPersistentRDDs()
    
    df = func()
    sc._jsc.getPersistentRDDs()
    
    df = df.filter('col1 != 2')
    import gc
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    df.select('*').explain()
    
    del df
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o317}
    
    >>> df = df.filter('col1 != 2')
    >>> import gc
    >>> gc.collect()
    244
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o323}
    
    >>> df.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#220L) AND NOT (col1#220L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#220L], [isnotnull(col1#220L), NOT (col1#220L = 2)]
             +- InMemoryRelation [col1#220L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#218L AS col1#220L]
                      +- *(1) Scan ExistingRDD[_1#218L]
    
    >>> del df
    >>> gc.collect()
    85
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o333}
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {116: JavaObject id=o398}
    
    >>> df2 = df.filter('col1 != 2')
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#312L) AND NOT (col1#312L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#312L], [isnotnull(col1#312L), NOT (col1#312L = 2)]
             +- InMemoryRelation [col1#312L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#310L AS col1#312L]
                      +- *(1) Scan ExistingRDD[_1#310L]
    
    >>> df.unpersist()
    DataFrame[col1: bigint]
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Project [_1#310L AS col1#312L]
    +- *(1) Filter (isnotnull(_1#310L) AND NOT (_1#310L = 2))
       +- *(1) Scan ExistingRDD[_1#310L]
    
    结果:

    def func():
        data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
        data.cache()
        data.count()
        return data
    
    sc._jsc.getPersistentRDDs()
    
    df = func()
    sc._jsc.getPersistentRDDs()
    
    df2 = df.filter('col1 != 2')
    del df
    import gc
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    df2.select('*').explain()
    
    del df2
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o234}
    
    >>> df2 = df.filter('col1 != 2')
    >>> del df
    >>> import gc
    >>> gc.collect()
    93
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o240}
    
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#174L) AND NOT (col1#174L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#174L], [isnotnull(col1#174L), NOT (col1#174L = 2)]
             +- InMemoryRelation [col1#174L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#172L AS col1#174L]
                      +- *(1) Scan ExistingRDD[_1#172L]
    
    >>> del df2
    >>> gc.collect()
    85
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o250}
    
    def func():
        data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
        data.cache()
        data.count()
        return data
    
    sc._jsc.getPersistentRDDs()
    
    df = func()
    sc._jsc.getPersistentRDDs()
    
    df = df.filter('col1 != 2')
    import gc
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    df.select('*').explain()
    
    del df
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o317}
    
    >>> df = df.filter('col1 != 2')
    >>> import gc
    >>> gc.collect()
    244
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o323}
    
    >>> df.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#220L) AND NOT (col1#220L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#220L], [isnotnull(col1#220L), NOT (col1#220L = 2)]
             +- InMemoryRelation [col1#220L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#218L AS col1#220L]
                      +- *(1) Scan ExistingRDD[_1#218L]
    
    >>> del df
    >>> gc.collect()
    85
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o333}
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {116: JavaObject id=o398}
    
    >>> df2 = df.filter('col1 != 2')
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#312L) AND NOT (col1#312L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#312L], [isnotnull(col1#312L), NOT (col1#312L = 2)]
             +- InMemoryRelation [col1#312L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#310L AS col1#312L]
                      +- *(1) Scan ExistingRDD[_1#310L]
    
    >>> df.unpersist()
    DataFrame[col1: bigint]
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Project [_1#310L AS col1#312L]
    +- *(1) Filter (isnotnull(_1#310L) AND NOT (_1#310L = 2))
       +- *(1) Scan ExistingRDD[_1#310L]
    
    实验2:

    def func():
        data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
        data.cache()
        data.count()
        return data
    
    sc._jsc.getPersistentRDDs()
    
    df = func()
    sc._jsc.getPersistentRDDs()
    
    df2 = df.filter('col1 != 2')
    del df
    import gc
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    df2.select('*').explain()
    
    del df2
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o234}
    
    >>> df2 = df.filter('col1 != 2')
    >>> del df
    >>> import gc
    >>> gc.collect()
    93
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o240}
    
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#174L) AND NOT (col1#174L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#174L], [isnotnull(col1#174L), NOT (col1#174L = 2)]
             +- InMemoryRelation [col1#174L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#172L AS col1#174L]
                      +- *(1) Scan ExistingRDD[_1#172L]
    
    >>> del df2
    >>> gc.collect()
    85
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o250}
    
    def func():
        data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
        data.cache()
        data.count()
        return data
    
    sc._jsc.getPersistentRDDs()
    
    df = func()
    sc._jsc.getPersistentRDDs()
    
    df = df.filter('col1 != 2')
    import gc
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    df.select('*').explain()
    
    del df
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o317}
    
    >>> df = df.filter('col1 != 2')
    >>> import gc
    >>> gc.collect()
    244
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o323}
    
    >>> df.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#220L) AND NOT (col1#220L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#220L], [isnotnull(col1#220L), NOT (col1#220L = 2)]
             +- InMemoryRelation [col1#220L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#218L AS col1#220L]
                      +- *(1) Scan ExistingRDD[_1#218L]
    
    >>> del df
    >>> gc.collect()
    85
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o333}
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {116: JavaObject id=o398}
    
    >>> df2 = df.filter('col1 != 2')
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#312L) AND NOT (col1#312L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#312L], [isnotnull(col1#312L), NOT (col1#312L = 2)]
             +- InMemoryRelation [col1#312L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#310L AS col1#312L]
                      +- *(1) Scan ExistingRDD[_1#310L]
    
    >>> df.unpersist()
    DataFrame[col1: bigint]
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Project [_1#310L AS col1#312L]
    +- *(1) Filter (isnotnull(_1#310L) AND NOT (_1#310L = 2))
       +- *(1) Scan ExistingRDD[_1#310L]
    
    结果:

    def func():
        data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
        data.cache()
        data.count()
        return data
    
    sc._jsc.getPersistentRDDs()
    
    df = func()
    sc._jsc.getPersistentRDDs()
    
    df2 = df.filter('col1 != 2')
    del df
    import gc
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    df2.select('*').explain()
    
    del df2
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o234}
    
    >>> df2 = df.filter('col1 != 2')
    >>> del df
    >>> import gc
    >>> gc.collect()
    93
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o240}
    
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#174L) AND NOT (col1#174L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#174L], [isnotnull(col1#174L), NOT (col1#174L = 2)]
             +- InMemoryRelation [col1#174L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#172L AS col1#174L]
                      +- *(1) Scan ExistingRDD[_1#172L]
    
    >>> del df2
    >>> gc.collect()
    85
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o250}
    
    def func():
        data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
        data.cache()
        data.count()
        return data
    
    sc._jsc.getPersistentRDDs()
    
    df = func()
    sc._jsc.getPersistentRDDs()
    
    df = df.filter('col1 != 2')
    import gc
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    df.select('*').explain()
    
    del df
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o317}
    
    >>> df = df.filter('col1 != 2')
    >>> import gc
    >>> gc.collect()
    244
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o323}
    
    >>> df.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#220L) AND NOT (col1#220L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#220L], [isnotnull(col1#220L), NOT (col1#220L = 2)]
             +- InMemoryRelation [col1#220L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#218L AS col1#220L]
                      +- *(1) Scan ExistingRDD[_1#218L]
    
    >>> del df
    >>> gc.collect()
    85
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o333}
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {116: JavaObject id=o398}
    
    >>> df2 = df.filter('col1 != 2')
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#312L) AND NOT (col1#312L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#312L], [isnotnull(col1#312L), NOT (col1#312L = 2)]
             +- InMemoryRelation [col1#312L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#310L AS col1#312L]
                      +- *(1) Scan ExistingRDD[_1#310L]
    
    >>> df.unpersist()
    DataFrame[col1: bigint]
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Project [_1#310L AS col1#312L]
    +- *(1) Filter (isnotnull(_1#310L) AND NOT (_1#310L = 2))
       +- *(1) Scan ExistingRDD[_1#310L]
    
    实验3(对照实验,表明
    unpersist
    有效)

    结果:

    def func():
        data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
        data.cache()
        data.count()
        return data
    
    sc._jsc.getPersistentRDDs()
    
    df = func()
    sc._jsc.getPersistentRDDs()
    
    df2 = df.filter('col1 != 2')
    del df
    import gc
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    df2.select('*').explain()
    
    del df2
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o234}
    
    >>> df2 = df.filter('col1 != 2')
    >>> del df
    >>> import gc
    >>> gc.collect()
    93
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o240}
    
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#174L) AND NOT (col1#174L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#174L], [isnotnull(col1#174L), NOT (col1#174L = 2)]
             +- InMemoryRelation [col1#174L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#172L AS col1#174L]
                      +- *(1) Scan ExistingRDD[_1#172L]
    
    >>> del df2
    >>> gc.collect()
    85
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {71: JavaObject id=o250}
    
    def func():
        data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
        data.cache()
        data.count()
        return data
    
    sc._jsc.getPersistentRDDs()
    
    df = func()
    sc._jsc.getPersistentRDDs()
    
    df = df.filter('col1 != 2')
    import gc
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    df.select('*').explain()
    
    del df
    gc.collect()
    sc._jvm.System.gc()
    sc._jsc.getPersistentRDDs()
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o317}
    
    >>> df = df.filter('col1 != 2')
    >>> import gc
    >>> gc.collect()
    244
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o323}
    
    >>> df.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#220L) AND NOT (col1#220L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#220L], [isnotnull(col1#220L), NOT (col1#220L = 2)]
             +- InMemoryRelation [col1#220L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#218L AS col1#220L]
                      +- *(1) Scan ExistingRDD[_1#218L]
    
    >>> del df
    >>> gc.collect()
    85
    >>> sc._jvm.System.gc()
    >>> sc._jsc.getPersistentRDDs()
    {86: JavaObject id=o333}
    
    >>> def func():
    ...     data = spark.createDataFrame([[1],[2],[3]]).toDF('col1')
    ...     data.cache()
    ...     data.count()
    ...     return data
    ...
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df = func()
    >>> sc._jsc.getPersistentRDDs()
    {116: JavaObject id=o398}
    
    >>> df2 = df.filter('col1 != 2')
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Filter (isnotnull(col1#312L) AND NOT (col1#312L = 2))
    +- *(1) ColumnarToRow
       +- InMemoryTableScan [col1#312L], [isnotnull(col1#312L), NOT (col1#312L = 2)]
             +- InMemoryRelation [col1#312L], StorageLevel(disk, memory, deserialized, 1 replicas)
                   +- *(1) Project [_1#310L AS col1#312L]
                      +- *(1) Scan ExistingRDD[_1#310L]
    
    >>> df.unpersist()
    DataFrame[col1: bigint]
    >>> sc._jsc.getPersistentRDDs()
    {}
    
    >>> df2.select('*').explain()
    == Physical Plan ==
    *(1) Project [_1#310L AS col1#312L]
    +- *(1) Filter (isnotnull(_1#310L) AND NOT (_1#310L = 2))
       +- *(1) Scan ExistingRDD[_1#310L]
    
    要回答OP的问题:

    这是否意味着缓存的数据帧不再可用,将被垃圾收集?这是否意味着新的后过滤器df将从头开始计算所有内容,尽管它是从以前缓存的数据帧生成的

    实验表明这两种方法都不适用。数据帧保持缓存状态,不进行垃圾收集,根据查询计划,使用缓存的(不可引用的)数据帧计算新的数据帧

    与缓存使用相关的一些有用功能(如果您不想通过Spark UI进行此操作)包括:

    sc.\u jsc.getPersistentRDDs()
    ,其中显示缓存RDD/数据帧的列表,以及

    spark.catalog.clearCache()
    ,用于清除所有缓存的RDD/数据帧

    在做上述工作时,我是否偏离了最佳实践


    我无权就此对您进行评判,但正如其中一条评论所建议的,请避免重新分配到
    df
    ,因为数据帧是不可变的。试着想象一下,您正在用scala编码,并且将
    df
    定义为
    val
    。执行
    df=df.filter(…)
    是不可能的。Python本身无法实现这一点,但我认为最好的做法是避免覆盖任何数据帧变量,这样,如果不再需要缓存的结果,您就可以随时调用
    df.unpersist()

    希望阐明Spark在缓存方面的行为

  • 当你有一个

    df = ... do stuff...
    df.cache()
    df.count()
    
  • …然后在应用程序中的其他地方

       another_df = ... do *same* stuff...
       another_df.*some_action()*
    
    …,您将期望
    另一个_df
    重用缓存的
    df
    数据帧。毕竟,重用先前计算的结果是缓存的目标。意识到这一点,Spark开发人员决定使用分析的逻辑计划作为识别缓存数据帧的“关键”,而不是仅仅依赖于来自应用程序端的引用。 在Spark中,组件是否在索引序列
    cachedData
    中跟踪缓存的计算:

      /**
       * Maintains the list of cached plans as an immutable sequence.  Any updates to the list
       * should be protected in a "this.synchronized" block which includes the reading of the
       * existing value and the update of the cachedData var.
       */
      @transient @volatile
      private var cachedData = IndexedSeq[CachedData]()
    
    在查询规划期间(在缓存管理器阶段),将扫描此结构以查找正在分析的计划的所有子树,以查看是否已经计算了其中的任何子树。如果找到匹配项,Spark将使用
    cachedData
    中相应的
    InMemoryRelation
    替换此子树

  • cache()
    (是
    persist()
    的简单同义词)函数通过调用
    CacheManager
  • 请注意,这与RDD缓存不同,RDD缓存只使用
    内存级别。一旦缓存,数据帧将保持缓存在内存中或本地执行器磁盘上,直到显式地
    取消持久化,或者调用CacheManager的
    clearCache()
    。当executor存储内存完全填满时,缓存块开始使用LRU(最近使用最少的)推送到磁盘,但决不会简单地“丢弃”


    好问题,顺便说一句…

    您的内存不足,因为您没有在最后一个操作后取消df的持久化。您需要在ac的沿袭图之后进行清理