Pyspark 正在检查dataframe中列的日期时间格式
我有一个输入日期框,其中包含以下数据:Pyspark 正在检查dataframe中列的日期时间格式,pyspark,spark-dataframe,Pyspark,Spark Dataframe,我有一个输入日期框,其中包含以下数据: id date_column 1 2011-07-09 11:29:31+0000 2 2011-07-09T11:29:31+0000 3 2011-07-09T11:29:31 4 2011-07-09T11:29:31+0000 我想检查date_列的格式是否与格式“%Y-%m-%dT%H:%m:%S+0000”匹配,如果格式匹配,我想添加一列,该列的值为1,否则为0。 目前,我已经定义了一个UDF来执行此操作:
id date_column
1 2011-07-09 11:29:31+0000
2 2011-07-09T11:29:31+0000
3 2011-07-09T11:29:31
4 2011-07-09T11:29:31+0000
我想检查date_列的格式是否与格式“%Y-%m-%dT%H:%m:%S+0000”匹配,如果格式匹配,我想添加一列,该列的值为1,否则为0。
目前,我已经定义了一个UDF来执行此操作:
def date_pattern_matching(value, pattern):
try:
datetime.strptime(str(value),pattern)
return "1"
except:
return "0"
它生成以下输出数据帧:
id date_column output
1 2011-07-09 11:29:31+0000 0
2 2011-07-09T11:29:31+0000 1
3 2011-07-09T11:29:31 0
4 2011-07-09T11:29:31+0000 1
通过UDF执行需要很多时间,是否有其他方法来实现它?尝试带有when-other块的regex-pyspark.sql.Column.rlike运算符
from pyspark.sql import functions as F
data = [[1, '2011-07-09 11:29:31+0000'],
[1,"2011-07-09 11:29:31+0000"],
[2,"2011-07-09T11:29:31+0000"],
[3,"2011-07-09T11:29:31"],
[4,"2011-07-09T11:29:31+0000"]]
df = spark.createDataFrame(data, ["id", "date_column"])
regex = "([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\+?\-?[0-9]{4})"
df_w_output = df.select("*", F.when(F.col("date_column").rlike(regex), 1).otherwise(0).alias("output"))
df_w_output.show()
Output
+---+------------------------+------+
|id |date_column |output|
+---+------------------------+------+
|1 |2011-07-09 11:29:31+0000|0 |
|1 |2011-07-09 11:29:31+0000|0 |
|2 |2011-07-09T11:29:31+0000|1 |
|3 |2011-07-09T11:29:31 |0 |
|4 |2011-07-09T11:29:31+0000|1 |
+---+------------------------+------+
使用when-other块尝试regex-pyspark.sql.Column.rlike运算符
from pyspark.sql import functions as F
data = [[1, '2011-07-09 11:29:31+0000'],
[1,"2011-07-09 11:29:31+0000"],
[2,"2011-07-09T11:29:31+0000"],
[3,"2011-07-09T11:29:31"],
[4,"2011-07-09T11:29:31+0000"]]
df = spark.createDataFrame(data, ["id", "date_column"])
regex = "([0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}\+?\-?[0-9]{4})"
df_w_output = df.select("*", F.when(F.col("date_column").rlike(regex), 1).otherwise(0).alias("output"))
df_w_output.show()
Output
+---+------------------------+------+
|id |date_column |output|
+---+------------------------+------+
|1 |2011-07-09 11:29:31+0000|0 |
|1 |2011-07-09 11:29:31+0000|0 |
|2 |2011-07-09T11:29:31+0000|1 |
|3 |2011-07-09T11:29:31 |0 |
|4 |2011-07-09T11:29:31+0000|1 |
+---+------------------------+------+
你能在这里添加自定义项代码吗?让我们看看是否有任何改进的可能性。你能在这里添加UDF代码吗?让我们看看是否有改进的可能。