Pyspark Lambda表达式+;皮斯帕克

Pyspark Lambda表达式+;皮斯帕克,pyspark,azure-databricks,pyspark-dataframes,Pyspark,Azure Databricks,Pyspark Dataframes,我试图将spark数据框中的一列与给定日期进行比较,如果列日期小于给定日期,则添加n小时,否则添加x小时 差不多 addhours = lambda x,y: X + 14hrs if (x < y) else X + 10hrs 这是df的样品 from pyspark.sql import functions as F import datetime df = spark.createDataFrame([('America/NewYork', '2020-02-01 10:00:0

我试图将spark数据框中的一列与给定日期进行比较,如果列日期小于给定日期,则添加n小时,否则添加x小时

差不多

addhours = lambda x,y: X + 14hrs if (x < y) else X + 10hrs
这是df的样品

from pyspark.sql import functions as F
import datetime
df = spark.createDataFrame([('America/NewYork', '2020-02-01 10:00:00'),('Africa/Nairobi', '2020-02-01 10:00:00')],["OriginTz", "Time"])

触发数据帧有点新:)

使用
when+otherise
语句,而不是
udf

示例:

from pyspark.sql import functions as F

#we are casting to timestamp and date so that we can compare in when
df = spark.createDataFrame([('America/NewYork', '2020-02-01 10:00:00'),('Africa/Nairobi', '2003-02-01 10:00:00')],["OriginTz", "Time"]).\
withColumn("literal",F.lit('2015-01-01').cast("date")).\
withColumn("Time",F.col("Time").cast("timestamp"))

df.show()
#+---------------+-------------------+----------+
#|       OriginTz|               Time|   literal|
#+---------------+-------------------+----------+
#|America/NewYork|2020-02-01 10:00:00|2015-01-01|
#| Africa/Nairobi|2003-02-01 10:00:00|2015-01-01|
#+---------------+-------------------+----------+

#using unix_timestamp function converting to epoch time then adding 10*3600 -> 10 hrs finally converting to timestamp format
df.withColumn("new_date",F.when(F.col("Time") > F.col("literal"),F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss')  + 10 * 3600)).\
    otherwise(F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss')  + 14 * 3600))).\
show()

#+---------------+-------------------+----------+-------------------+
#|       OriginTz|               Time|   literal|           new_date|
#+---------------+-------------------+----------+-------------------+
#|America/NewYork|2020-02-01 10:00:00|2015-01-01|2020-02-01 20:00:00|
#| Africa/Nairobi|2003-02-01 10:00:00|2015-01-01|2003-02-02 00:00:00|
#+---------------+-------------------+----------+-------------------+
如果您不想将文本值添加为数据帧列,请使用

lit_val='2015-01-01'

df = spark.createDataFrame([('America/NewYork', '2020-02-01 10:00:00'),('Africa/Nairobi', '2003-02-01 10:00:00')],["OriginTz", "Time"]).\
withColumn("Time",F.col("Time").cast("timestamp"))

df.withColumn("new_date",F.when(F.col("Time") > F.lit(lit_val).cast("date"),F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss')  + 10 * 3600)).\
    otherwise(F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss')  + 14 * 3600))).\
show()

#+---------------+-------------------+----------+-------------------+
#|       OriginTz|               Time|   literal|           new_date|
#+---------------+-------------------+----------+-------------------+
#|America/NewYork|2020-02-01 10:00:00|2015-01-01|2020-02-01 20:00:00|
#| Africa/Nairobi|2003-02-01 10:00:00|2015-01-01|2003-02-02 00:00:00|
#+---------------+-------------------+----------+-------------------+

您也可以使用
.expr
间隔
执行此操作。这样,您就不必转换为其他格式

from pyspark.sql import functions as F
df.withColumn("new_date", F.expr("""IF(Time<y, Time + interval 14 hours, Time + interval 10 hours)""")).show()

#+---------------+-------------------+----------+-------------------+
#|       OriginTz|               Time|         y|           new_date|
#+---------------+-------------------+----------+-------------------+
#|America/NewYork|2020-02-01 10:00:00|2020-01-01|2020-02-01 20:00:00|
#| Africa/Nairobi|2020-02-01 10:00:00|2020-01-01|2020-02-01 20:00:00|
#+---------------+-------------------+----------+-------------------+
从pyspark.sql导入函数为F
df.withColumn(“新的日期”,F.expr(“如果(时间)”
from pyspark.sql import functions as F
df.withColumn("new_date", F.expr("""IF(Time<y, Time + interval 14 hours, Time + interval 10 hours)""")).show()

#+---------------+-------------------+----------+-------------------+
#|       OriginTz|               Time|         y|           new_date|
#+---------------+-------------------+----------+-------------------+
#|America/NewYork|2020-02-01 10:00:00|2020-01-01|2020-02-01 20:00:00|
#| Africa/Nairobi|2020-02-01 10:00:00|2020-01-01|2020-02-01 20:00:00|
#+---------------+-------------------+----------+-------------------+