Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/352.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何计算pyspark中的日期差?_Python_Apache Spark_Dataframe_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 如何计算pyspark中的日期差?

Python 如何计算pyspark中的日期差?,python,apache-spark,dataframe,pyspark,apache-spark-sql,Python,Apache Spark,Dataframe,Pyspark,Apache Spark Sql,我有这样的数据: df = sqlContext.createDataFrame([ ('1986/10/15', 'z', 'null'), ('1986/10/15', 'z', 'null'), ('1986/10/15', 'c', 'null'), ('1986/10/15', 'null', 'null'), ('1986/10/16', 'null', '4.0')], ('low', 'high', 'normal')) 我想计

我有这样的数据:

df = sqlContext.createDataFrame([
    ('1986/10/15', 'z', 'null'), 
    ('1986/10/15', 'z', 'null'),
    ('1986/10/15', 'c', 'null'),
    ('1986/10/15', 'null', 'null'),
    ('1986/10/16', 'null', '4.0')],
    ('low', 'high', 'normal'))

我想计算
low
列和
2017-05-02
列之间的日期差异,并用差异替换
low
列。我在stackoverflow上尝试过相关的解决方案,但两者都不起作用。

您需要将列
low
转换为class date,然后您可以将
datediff()
lit()
结合使用。使用Spark 2.2:

from pyspark.sql.functions import datediff, to_date, lit

df.withColumn("test", 
              datediff(to_date(lit("2017-05-02")),
                       to_date("low","yyyy/MM/dd"))).show()
+----------+----+------+-----+
|       low|high|normal| test|
+----------+----+------+-----+
|1986/10/15|   z|  null|11157|
|1986/10/15|   z|  null|11157|
|1986/10/15|   c|  null|11157|
|1986/10/15|null|  null|11157|
|1986/10/16|null|   4.0|11156|
+----------+----+------+-----+
使用,我们需要首先将
列转换为class
时间戳

from pyspark.sql.functions import datediff, to_date, lit, unix_timestamp

df.withColumn("test", 
              datediff(to_date(lit("2017-05-02")),
                       to_date(unix_timestamp('low', "yyyy/MM/dd").cast("timestamp")))).show()

或者,如何使用pySpark查找两个后续用户操作之间经过的天数:

import pyspark.sql.functions as funcs
from pyspark.sql.window import Window

window = Window.partitionBy('user_id').orderBy('action_date')

df = df.withColumn("days_passed", funcs.datediff(df.action_date, 
                                  funcs.lag(df.action_date, 1).over(window)))



+----------+-----------+-----------+
|   user_id|action_date|days_passed| 
+----------+-----------+-----------+
|623       |2015-10-21|        null|
|623       |2015-11-19|          29|
|623       |2016-01-13|          59|
|623       |2016-01-21|           8|
|623       |2016-03-24|          63|
+----------+----------+------------+

我有一个错误“TypeError:to_date()正好接受1个参数(2个给定)”,如果
low
列中有Nan值会怎么样?这是因为您使用的是Spark<2.2Tanks。以上只是一个测试数据。我的真实数据在
low
列中有许多值,无法转换到
时间戳中。当`cast(“timestamp”)`时,如何将这些值设置为NaN?就像pandas:pd.to_datetime(errors='concurve'),那么它将默认为
null
,这不是您想要的吗?