Dataframe for循环如何影响spark代码的性能
我下面有两段逻辑相同的代码,好奇地想知道哪一段更好,为什么 一,Dataframe for循环如何影响spark代码的性能,dataframe,apache-spark,for-loop,pyspark,Dataframe,Apache Spark,For Loop,Pyspark,我下面有两段逻辑相同的代码,好奇地想知道哪一段更好,为什么 一, 这两种代码的逻辑不同。第二个代码应该是您想要的。在第一个代码中,您选择了重复的列,因为select不会覆盖列,但withColumn会覆盖列 import pyspark.sql.functions as func char_list = [('\\\\', '\\\\\\\\'), ('\n', '\\\\n'), ('\'', '\\\\\'')] col_names = df.schema.names df = spa
这两种代码的逻辑不同。第二个代码应该是您想要的。在第一个代码中,您选择了重复的列,因为
select
不会覆盖列,但withColumn
会覆盖列
import pyspark.sql.functions as func
char_list = [('\\\\', '\\\\\\\\'), ('\n', '\\\\n'), ('\'', '\\\\\'')]
col_names = df.schema.names
df = spark.createDataFrame([['1','2']])
print(len(df.select( *[func.regexp_replace(col_name, char_set[0], char_set[1]) for char_set in char_list for col_name in col_names]).columns))
# gives 6
df = spark.createDataFrame([['1','2']])
for char_set in char_list:
for col_name in col_names:
df = df.withColumn(col_name, func.regexp_replace(col_name, char_set[0], char_set[1]))
print(len(df.columns))
# gives 2
我可以知道在第二个代码中,循环在驱动程序或执行器上执行。我假设作为select语句一部分编写的for循环在执行器上执行。Am处于正确状态?由于延迟计算,未执行任何操作。代码只包括转换,但不包括操作。只创建了一个查询计划。您可以执行
df.explain()
来检查查询计划。我可以知道第二个代码是有效的还是有其他方法吗?是的,我觉得不错
char_list = [('\\\\', '\\\\\\\\'), ('\n', '\\\\n'), ('\'', '\\\\\'')]
col_names = df.schema.names
for char_set in char_list:
for col_name in col_names:
df = df.withColumn(col_name, func.regexp_replace(col_name, char_set[0], char_set[1]))
import pyspark.sql.functions as func
char_list = [('\\\\', '\\\\\\\\'), ('\n', '\\\\n'), ('\'', '\\\\\'')]
col_names = df.schema.names
df = spark.createDataFrame([['1','2']])
print(len(df.select( *[func.regexp_replace(col_name, char_set[0], char_set[1]) for char_set in char_list for col_name in col_names]).columns))
# gives 6
df = spark.createDataFrame([['1','2']])
for char_set in char_list:
for col_name in col_names:
df = df.withColumn(col_name, func.regexp_replace(col_name, char_set[0], char_set[1]))
print(len(df.columns))
# gives 2