Spark中的Python函数_Python_Pyspark

Spark中的Python函数

python pyspark

Spark中的Python函数,python,pyspark,Python,Pyspark,我正试图根据“活动”\u优先级”\u舍入数据框上的标志向前或向后舍入“截止日期”列-1表示向后，0表示无，1表示向前当我使用单个日期作为变量时，该函数可以工作，但我很难将其应用于整个数据集。我得到的错误是“ValueError:无法将列转换为bool:在构建DataFrame布尔表达式时，请使用“&”表示“and”，使用“|”表示“or”，使用“~”表示“not”。尝试传递函数的列部分时。python中相对较新的构建函数 from pyspark.sql.functions import ne

我正试图根据“活动”\u优先级”\u舍入数据框上的标志向前或向后舍入“截止日期”列-1表示向后，0表示无，1表示向前

当我使用单个日期作为变量时，该函数可以工作，但我很难将其应用于整个数据集。我得到的错误是“ValueError:无法将列转换为bool:在构建DataFrame布尔表达式时，请使用“&”表示“and”，使用“|”表示“or”，使用“~”表示“not”。尝试传递函数的列部分时。python中相对较新的构建函数

from pyspark.sql.functions import next_day, date_sub
from pyspark.sql.functions import to_date


def next_date(column,date,dayOfWEek):
  if column == -1:
    return date_sub(next_day(date,dayOfWEek),0)
  elif column == 1:
    return date_sub(next_day(date,dayOfWEek),7)
  else:
    return date


activity_prioritization_rounding= sql("""select * from spa.activity_master""")
activity_prioritization_rounding.withColumn(
   "New_Date",
    next_date(col("deadline_rounding"),col("deadline_date"),"Friday"))
)

您需要从python函数中创建一个

udf

，并将

Friday

作为列发送，因为它不会在数据帧中广播。您可以使用

lit

来执行此操作

from pyspark.sql.functions import udf, next_day, date_sub, to_date, lit
from pyspark.sql.types import DateType

activity_prioritization_rounding.withColumn("New_Date",udf(next_date(col("deadline_rounding"),col("deadline_date"),lit("Friday"), DateType())))

编辑：正如@jxc正确提到的，您不能在

UDF

中使用spark函数

将其简化为

when（）.when（）.others（）

你能把痕迹贴出来吗？感觉您的错误比提供的代码更深。只需将函数next_date（）中的

if else

转换为使用

when（）。when（）。否则（）

。我仍然收到一个错误：ValueError:无法将列转换为bool:请使用“&”表示“and”，使用“|”表示“or”，在构建数据帧布尔表达式时，“~”表示“否”。问题似乎是将截止日期舍入到下一个日期的函数。“如果列==-1：”

from pyspark.sql.functions import udf, next_day, date_sub, to_date, lit, when, col

day_of_week = "Friday"
activity_prioritization_rounding.withColumn("New_Date", when(
    col("deadline_rounding") == -1, date_sub(next_day(col("deadline_date"), day_of_week), 0)).when(
    col("deadline_rounding") == 1, date_sub(next_day(col("deadline_date"), day_of_week), 7)).otherwise(
    col("deadline_date")))