Apache spark PySpark UDF到多个列_Apache Spark_Pyspark

Apache spark PySpark UDF到多个列

apache-spark pyspark

Apache spark PySpark UDF到多个列,apache-spark,pyspark,Apache Spark,Pyspark,所以我有一个这样的UDF： tudf = udf(lambda value: 1 if value>=1 else 0,IntegerType()) df = fdf.withColumn('COLUMN1',tudf(df.COLUMN1)) 我通常会这样通过UDF： tudf = udf(lambda value: 1 if value>=1 else 0,IntegerType()) df = fdf.withColumn('COLUMN1',tudf(df.CO

所以我有一个这样的UDF：

 tudf = udf(lambda value: 1 if value>=1 else 0,IntegerType())

 df = fdf.withColumn('COLUMN1',tudf(df.COLUMN1))

我通常会这样通过UDF：

 tudf = udf(lambda value: 1 if value>=1 else 0,IntegerType())

 df = fdf.withColumn('COLUMN1',tudf(df.COLUMN1))

我想知道是否有任何方法可以做到这一点，但不必逐一阅读多个专栏。

使用理解：

fdf.select([
  tudf(c).alias(c) if c in cols_to_transform else c for c in fdf.columns
])

尽管此处不建议使用

udf

from pyspark.sql.functions import when, col

fdf.select([
  when(col(c) >= 1, 1).otherwise(0).alias(c) if c in cols_to_transform else c 
  for c in fdf.columns
])

使用理解：

fdf.select([
  tudf(c).alias(c) if c in cols_to_transform else c for c in fdf.columns
])

尽管此处不建议使用

udf

from pyspark.sql.functions import when, col

fdf.select([
  when(col(c) >= 1, 1).otherwise(0).alias(c) if c in cols_to_transform else c 
  for c in fdf.columns
])

谢谢你，伙计。不建议这样做是因为它的计算量更大吗？谢谢你，伙计。不建议这样做，因为它的计算量更大？