Python 在多列上使用df.withColumn（）_Python_Python 2.7_Pyspark_Pyspark Sql_Spss Modeler

Python 在多列上使用df.withColumn（）

python python-2.7 pyspark

Python 在多列上使用df.withColumn（）,python,python-2.7,pyspark,pyspark-sql,spss-modeler,Python,Python 2.7,Pyspark,Pyspark Sql,Spss Modeler,我正在使用python和pyspark扩展SPSS Modeler 我想操纵~5000列，因此使用以下构造： for target in targets: inputData = inputData.withColumn(target+appendString, function(target)) 这很慢。对于所有目标列，是否有更有效的方法 targets包含要使用的列名列表，function（target）是一个占位符，我在其中处理不同的列，如添加和分割如果你能帮助我，我会很高兴：

我正在使用python和pyspark扩展SPSS Modeler

我想操纵~5000列，因此使用以下构造：

for target in targets:
    inputData = inputData.withColumn(target+appendString, function(target))

这很慢。对于所有目标列，是否有更有效的方法

targets

包含要使用的列名列表，

function（target）

是一个占位符，我在其中处理不同的列，如添加和分割

如果你能帮助我，我会很高兴：）

pandayo试试这个：

inputData.select(
    '*', 
    *(function(target).alias(target+appendString) for target in targets)
)

你能比较一下这种方法的执行计划和OP提出的执行计划吗？我怀疑，虽然这看起来更整洁，但实际上它在幕后做着同样的事情。这种方法不会每次都重新影响数据帧。您只生成一个数据帧。但是，是的，执行计划可能是相同的。谢谢你，这很有帮助。