将自定义函数应用于以dataframe为参数的多个spark dataframe

将自定义函数应用于以dataframe为参数的多个spark dataframe,dataframe,lambda,pyspark,Dataframe,Lambda,Pyspark,全部, 我需要处理3个大数据帧,aDf、bDf和cDf,我想“修剪”字符串列以删除空格 from pyspark.sql.functions import trim, col for col_name,col_dtype in aDf.dtypes: if col_dtype == "string": aDf = aDf.withColumn(col_name, trim(col(col_name))) else: aDf = aD

全部,

我需要处理3个大数据帧,aDf、bDf和cDf,我想“修剪”字符串列以删除空格

from pyspark.sql.functions import trim, col
for col_name,col_dtype in aDf.dtypes:
    if col_dtype == "string":
       aDf = aDf.withColumn(col_name, trim(col(col_name)))
    else:
       aDf = aDf.withColumn(col_name, col(col_name))
        
for col_name,col_dtype in bDf.dtypes:
    if col_dtype == "string":
       bDf = bDf.withColumn(col_name, trim(col(col_name)))
    else:
       bDf = bDf.withColumn(col_name, col(col_name))
        
for col_name,col_dtype in cDf.dtypes:
    if col_dtype == "string":
       cDf = cDf.withColumn(col_name, trim(col(col_name)))
    else:
       cDf = cDf.withColumn(col_name, col(col_name))
有没有更好更有效的方法来处理这个简单的转换。每个数据帧中有近40列和大约100 MM的行

虽然这是可行的,但我觉得即使是数据帧也可以参数化。这样,代码变得更加通用


任何提示都可以。谢谢你可以重复使用代码,但所花的时间还是一样的

from functools import reduce
from pyspark.sql import functions as f

aDf = trimDF(aDf)
bDf = trimDF(bDf)
cDf = trimDF(cDf)

def trimDF(df):
    df = reduce(lambda df, col: df.withColumn(col[0],f.trim(f.col(col[0]))) if col[1]=='string' else df.withColumn(col[0],f.col(col[0])), df.dtypes, df)
    return df

这对你有帮助