Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/date/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Date Pyspark:202001和202053(yyyyww)截止日期为空_Date_Apache Spark_Pyspark_Apache Spark Sql_Week Number - Fatal编程技术网

Date Pyspark:202001和202053(yyyyww)截止日期为空

Date Pyspark:202001和202053(yyyyww)截止日期为空,date,apache-spark,pyspark,apache-spark-sql,week-number,Date,Apache Spark,Pyspark,Apache Spark Sql,Week Number,我有一个带有yearweek列的数据框,我想将其转换为日期。除了“202001”和“202053”周之外,我编写的代码似乎每周都有效,例如: df = spark.createDataFrame([ (1, "202001"), (2, "202002"), (3, "202003"), (4, "202052"), (5, "202053") ], ['id', 'week_year']

我有一个带有yearweek列的数据框,我想将其转换为日期。除了“202001”和“202053”周之外,我编写的代码似乎每周都有效,例如:

df = spark.createDataFrame([
(1, "202001"), 
(2, "202002"), 
(3, "202003"), 
(4, "202052"), 
(5, "202053")
], ['id', 'week_year'])

df.withColumn("date", F.to_date(F.col("week_year"), "yyyyw")).show()

这几周我弄不清是什么错误,也弄不清如何修复。如何将202001周和202053周转换为有效日期?

在Spark中处理ISO周确实是一个令人头痛的问题-事实上,Spark 3中不推荐(删除了?)此功能。我认为在UDF中使用Python datetime实用程序是一种更灵活的方法

import datetime
import pyspark.sql.functions as F

@F.udf('date')
def week_year_to_date(week_year):
    # the '1' is for specifying the first day of the week
    return datetime.datetime.strptime(week_year + '1', '%G%V%u')

df = spark.createDataFrame([
(1, "202001"), 
(2, "202002"), 
(3, "202003"), 
(4, "202052"), 
(5, "202053")
], ['id', 'week_year'])

df.withColumn("date", week_year_to_date('week_year')).show()
+---+---------+----------+
| id|week_year|      date|
+---+---------+----------+
|  1|   202001|2019-12-30|
|  2|   202002|2020-01-06|
|  3|   202003|2020-01-13|
|  4|   202052|2020-12-21|
|  5|   202053|2020-12-28|
+---+---------+----------+

根据mck的回答,这是我最终用于Python 3.5.2版的解决方案:

import datetime
from dateutil.relativedelta import relativedelta
import pyspark.sql.functions as F

@F.udf('date')
def week_year_to_date(week_year):
    # the '1' is for specifying the first day of the week
    return datetime.datetime.strptime(week_year + '1', '%Y%W%w') - relativedelta(weeks = 1)

df = spark.createDataFrame([
(9, "201952"), 
(1, "202001"), 
(2, "202002"), 
(3, "202003"), 
(4, "202052"), 
(5, "202053")
], ['id', 'week_year'])

df.withColumn("date", week_year_to_date('week_year')).show()

如果不使用在3.6中添加的“%G%V%u”,我必须从日期中减去一周才能得到正确的日期。

谢谢,这对我来说很有帮助!因为这是我的首选方式,所以我会把你的答案记下来。不幸的是,我们在集群上运行的是Python 3.5.2版,所以我不得不回到一个更加丑陋的解决方案。我会在另一个答案中加上我的。