通过PySpark的查询执行错误-GC错误_Pyspark_Garbage Collection_Impala

通过PySpark的查询执行错误-GC错误

pyspark

通过PySpark的查询执行错误-GC错误,pyspark,garbage-collection,impala,Pyspark,Garbage Collection,Impala,我需要提取配置单元数据库（具有多个模式）中每个表的行数。我编写了pyspark作业，它提取每个表的计数，当我尝试一些模式时，它工作得很好，但是当我尝试所有模式时，它失败了，出现GV开销错误。我尝试为整个数据库中的所有表查询创建UNIONALL，也尝试为模式中的所有表创建UNIONALL。两者都失败了，出现了GC错误你能建议避免这个错误吗。以下是我的脚本： # For loop for Schema starts here for schema in schemas_list: # D

我需要提取配置单元数据库（具有多个模式）中每个表的行数。我编写了pyspark作业，它提取每个表的计数，当我尝试一些模式时，它工作得很好，但是当我尝试所有模式时，它失败了，出现GV开销错误。我尝试为整个数据库中的所有表查询创建UNIONALL，也尝试为模式中的所有表创建UNIONALL。两者都失败了，出现了GC错误

你能建议避免这个错误吗。以下是我的脚本：

    # For loop for Schema starts here
for schema in schemas_list:

# Dataframe with all table names available in given Schema for level1 and level2
    tables_1_df=tables_df(schema,1)
    tables_1_list=formatted_list(tables_1_df,1)
    tables_2_df=tables_df(schema,2)
    tables_2_list=formatted_list(tables_2_df,2)
    tables_list=list(set(tables_1_list) & set(tables_2_list)) #Intersection of level1 and level2 tables per Schema Name

# For loop for Tables starts her
    for table in tables_list:

    # Creating Dataframe with Row Count of given table for level 1 and level2
        level_1_query=prep_query(schema, table, 1)
        level_2_query=prep_query(schema, table, 2)
        level_1_count_df=level_1_count_df.union(table_count(level_1_query))
        level_1_count_df.persist()
        level_2_count_df=level_2_count_df.union(table_count(level_2_query))
        level_2_count_df.persist()

# Validate if level1 and level2 are re-conciled, if not write the row into data frame which will intern write into file in S3 Location
level_1_2_join_df = level_1_count_df.alias("one").join(level_2_count_df.alias("two"),(level_1_count_df.schema_name==level_2_count_df.schema_name) & (level_1_count_df.table_name==level_2_count_df.table_name),'inner').select(col("one.schema_name"),col("two.table_name"),col("level_1_count"),col("level_2_count"))
main_df=header_df.union(level_1_2_join_df)
if extracttype=='DELTA':
    main_df=main_df.filter(main_df.level_1_count!=main_df.level_2_count)
main_df=main_df.select(concat(col("schema_name"),lit(","),col("table_name"),lit(","),col("level_1_count"),lit(","),col("level_2_count")))

    # creates file in temp location
file_output(main_df, tempfolder) # writes to txt file in hadoop

你能在主屏幕上运行.explain（）并发布spark如何执行的逻辑和物理计划吗？它可能是持久的（堆内存可能过载）。。我认为persist之后的连接会产生Gc开销。我知道，您正在持久化union的每个迭代，因此不是持久化一个级别为1\u count/级别为2\u count的大数据帧，而是持久化每个迭代（其大小呈指数增长）它在堆中创建了许多不必要的临时对象，这些临时对象的总重量超过了一个最终数据帧。尝试在for循环之外持久化最终级别1\u计数/级别2\u计数。感谢Mohammad，我确实在没有持久化任何数据帧的情况下运行了相同的脚本，我得到了GC开销错误。所以我认为，随着union all的出现，查询变得越来越大，并且由于这个错误而失败。然后我尝试为每个表持久化数据帧，然后连接level1和level2数据帧。在不在连接数据帧处执行查询时，此操作失败。