Python Pyspark-将时间戳传递给udf
我正在尝试检查一个基于时间戳的条件,如下所示,它向我抛出了一个错误。有人能指出我做错了什么吗-Python Pyspark-将时间戳传递给udf,python,pyspark,Python,Pyspark,我正在尝试检查一个基于时间戳的条件,如下所示,它向我抛出了一个错误。有人能指出我做错了什么吗- timestamp1 = pd.to_datetime('2018-02-14 12:09:36.0') timestamp2 = pd.to_datetime('2018-02-14 12:10:00.0') def check_formula(timestamp2, timestamp1, interval): if ((timestamp2-timestamp1)<=dat
timestamp1 = pd.to_datetime('2018-02-14 12:09:36.0')
timestamp2 = pd.to_datetime('2018-02-14 12:10:00.0')
def check_formula(timestamp2, timestamp1, interval):
if ((timestamp2-timestamp1)<=datetime.timedelta(minutes=(interval/2))):
return True
else:
return False
chck_formula = udf(check_formula, BooleanType())
ts= chck_formula(timestamp2, timestamp1, 5)
print(ts)
无论我们做什么,我们都需要使用
rdd
或dataframe
。因此,您只能对其中任何一个应用udf
。因此,您需要更改应用udf
的方式
这里有两种方法:-
from pyspark.sql import functions as F
import datetime
df = sqlContext.createDataFrame([
['2018-02-14 12:09:36.0', '2018-02-14 12:10:00.0'],
], ["t1", "t2"])
interval = 5
df.withColumn("check", F.datediff(F.col("t2"),F.col("t1")) <= datetime.timedelta(minutes=(interval/2)).total_seconds()).show(truncate=False)
+---------------------+---------------------+-----+
|t1 |t2 |check|
+---------------------+---------------------+-----+
|2018-02-14 12:09:36.0|2018-02-14 12:10:00.0|true |
+---------------------+---------------------+-----+
from pyspark.sql.functions import udf, lit
from pyspark.sql.types import BooleanType
def check_formula(timestamp2, timestamp1, interval):
if ((timestamp2-timestamp1)<=datetime.timedelta(minutes=(interval/2))):
return True
else:
return False
chck_formula = udf(check_formula, BooleanType())
df.withColumn("check", chck_formula(F.from_utc_timestamp(F.col("t2"), "PST"), F.from_utc_timestamp(F.col("t1"), "PST"), F.lit(5))).show(truncate=False)
+---------------------+---------------------+-----+
|t1 |t2 |check|
+---------------------+---------------------+-----+
|2018-02-14 12:09:36.0|2018-02-14 12:10:00.0|true |
+---------------------+---------------------+-----+
从pyspark.sql导入函数为F
导入日期时间
df=sqlContext.createDataFrame([
['2018-02-14 12:09:36.0', '2018-02-14 12:10:00.0'],
],[“t1”,“t2”])
间隔=5
df.withColumn(“check”、F.datediff(F.col(“t2”)、F.col(“t1”))有什么错误?@pault请查找错误
from pyspark.sql import functions as F
import datetime
df = sqlContext.createDataFrame([
['2018-02-14 12:09:36.0', '2018-02-14 12:10:00.0'],
], ["t1", "t2"])
interval = 5
df.withColumn("check", F.datediff(F.col("t2"),F.col("t1")) <= datetime.timedelta(minutes=(interval/2)).total_seconds()).show(truncate=False)
+---------------------+---------------------+-----+
|t1 |t2 |check|
+---------------------+---------------------+-----+
|2018-02-14 12:09:36.0|2018-02-14 12:10:00.0|true |
+---------------------+---------------------+-----+
from pyspark.sql.functions import udf, lit
from pyspark.sql.types import BooleanType
def check_formula(timestamp2, timestamp1, interval):
if ((timestamp2-timestamp1)<=datetime.timedelta(minutes=(interval/2))):
return True
else:
return False
chck_formula = udf(check_formula, BooleanType())
df.withColumn("check", chck_formula(F.from_utc_timestamp(F.col("t2"), "PST"), F.from_utc_timestamp(F.col("t1"), "PST"), F.lit(5))).show(truncate=False)
+---------------------+---------------------+-----+
|t1 |t2 |check|
+---------------------+---------------------+-----+
|2018-02-14 12:09:36.0|2018-02-14 12:10:00.0|true |
+---------------------+---------------------+-----+