将自定义函数应用于以dataframe为参数的多个spark dataframe
全部, 我需要处理3个大数据帧,aDf、bDf和cDf,我想“修剪”字符串列以删除空格将自定义函数应用于以dataframe为参数的多个spark dataframe,dataframe,lambda,pyspark,Dataframe,Lambda,Pyspark,全部, 我需要处理3个大数据帧,aDf、bDf和cDf,我想“修剪”字符串列以删除空格 from pyspark.sql.functions import trim, col for col_name,col_dtype in aDf.dtypes: if col_dtype == "string": aDf = aDf.withColumn(col_name, trim(col(col_name))) else: aDf = aD
from pyspark.sql.functions import trim, col
for col_name,col_dtype in aDf.dtypes:
if col_dtype == "string":
aDf = aDf.withColumn(col_name, trim(col(col_name)))
else:
aDf = aDf.withColumn(col_name, col(col_name))
for col_name,col_dtype in bDf.dtypes:
if col_dtype == "string":
bDf = bDf.withColumn(col_name, trim(col(col_name)))
else:
bDf = bDf.withColumn(col_name, col(col_name))
for col_name,col_dtype in cDf.dtypes:
if col_dtype == "string":
cDf = cDf.withColumn(col_name, trim(col(col_name)))
else:
cDf = cDf.withColumn(col_name, col(col_name))
有没有更好更有效的方法来处理这个简单的转换。每个数据帧中有近40列和大约100 MM的行
虽然这是可行的,但我觉得即使是数据帧也可以参数化。这样,代码变得更加通用
任何提示都可以。谢谢你可以重复使用代码,但所花的时间还是一样的
from functools import reduce
from pyspark.sql import functions as f
aDf = trimDF(aDf)
bDf = trimDF(bDf)
cDf = trimDF(cDf)
def trimDF(df):
df = reduce(lambda df, col: df.withColumn(col[0],f.trim(f.col(col[0]))) if col[1]=='string' else df.withColumn(col[0],f.col(col[0])), df.dtypes, df)
return df
这对你有帮助