Dataframe for循环如何影响spark代码的性能

Dataframe for循环如何影响spark代码的性能,dataframe,apache-spark,for-loop,pyspark,Dataframe,Apache Spark,For Loop,Pyspark,我下面有两段逻辑相同的代码,好奇地想知道哪一段更好,为什么 一, 这两种代码的逻辑不同。第二个代码应该是您想要的。在第一个代码中,您选择了重复的列,因为select不会覆盖列,但withColumn会覆盖列 import pyspark.sql.functions as func char_list = [('\\\\', '\\\\\\\\'), ('\n', '\\\\n'), ('\'', '\\\\\'')] col_names = df.schema.names df = spa

我下面有两段逻辑相同的代码,好奇地想知道哪一段更好,为什么

一,


  • 这两种代码的逻辑不同。第二个代码应该是您想要的。在第一个代码中,您选择了重复的列,因为
    select
    不会覆盖列,但
    withColumn
    会覆盖列

    import pyspark.sql.functions as func
    
    char_list = [('\\\\', '\\\\\\\\'), ('\n', '\\\\n'), ('\'', '\\\\\'')]
    col_names = df.schema.names
    
    df = spark.createDataFrame([['1','2']])
    print(len(df.select( *[func.regexp_replace(col_name, char_set[0], char_set[1]) for char_set in char_list for col_name in col_names]).columns))
    # gives 6
    
    df = spark.createDataFrame([['1','2']])
    for char_set in char_list:
        for col_name in col_names:
            df = df.withColumn(col_name, func.regexp_replace(col_name, char_set[0], char_set[1]))
    
    print(len(df.columns))
    # gives 2
    

    我可以知道在第二个代码中,循环在驱动程序或执行器上执行。我假设作为select语句一部分编写的for循环在执行器上执行。Am处于正确状态?由于延迟计算,未执行任何操作。代码只包括转换,但不包括操作。只创建了一个查询计划。您可以执行
    df.explain()
    来检查查询计划。我可以知道第二个代码是有效的还是有其他方法吗?是的,我觉得不错
    char_list = [('\\\\', '\\\\\\\\'), ('\n', '\\\\n'), ('\'', '\\\\\'')]
    col_names = df.schema.names
    for char_set in char_list:
        for col_name in col_names:
            df = df.withColumn(col_name, func.regexp_replace(col_name, char_set[0], char_set[1]))
    
    import pyspark.sql.functions as func
    
    char_list = [('\\\\', '\\\\\\\\'), ('\n', '\\\\n'), ('\'', '\\\\\'')]
    col_names = df.schema.names
    
    df = spark.createDataFrame([['1','2']])
    print(len(df.select( *[func.regexp_replace(col_name, char_set[0], char_set[1]) for char_set in char_list for col_name in col_names]).columns))
    # gives 6
    
    df = spark.createDataFrame([['1','2']])
    for char_set in char_list:
        for col_name in col_names:
            df = df.withColumn(col_name, func.regexp_replace(col_name, char_set[0], char_set[1]))
    
    print(len(df.columns))
    # gives 2