Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何计算PySpark中行之间的差异?_Python_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Python 如何计算PySpark中行之间的差异?

Python 如何计算PySpark中行之间的差异?,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,这是我在PySpark中的数据帧: utc_timestamp data feed 2015-10-13 11:00:00+00:00 1 A 2015-10-13 12:00:00+00:00 5 A 2015-10-13 13:00:00+00:00 6 A 2015-10-13 14:00:00+00:00 10 B 2015-10-13 15:00:00+00:00 11 B

这是我在PySpark中的数据帧:

utc_timestamp               data    feed
2015-10-13 11:00:00+00:00   1       A
2015-10-13 12:00:00+00:00   5       A
2015-10-13 13:00:00+00:00   6       A
2015-10-13 14:00:00+00:00   10      B
2015-10-13 15:00:00+00:00   11      B
数据的值是累积的

我想得到这个结果(连续行之间的差异,按
feed
分组):

pandas
中,我会这样做:

df["data"] -= (df.groupby("feed")["data"].shift(fill_value=0))
如何在PySpark中执行相同的操作?

您可以使用带有窗口的函数执行此操作:

from pyspark.sql.window import Window
import pyspark.sql.functions as f

w = Window.partitionBy("feed").orderBy("utc_timestamp")

df = df.withColumn("data", f.col("data") - f.lag(f.col("data"), 1, 0).over(window))
您可以使用带有窗口的函数执行此操作:

from pyspark.sql.window import Window
import pyspark.sql.functions as f

w = Window.partitionBy("feed").orderBy("utc_timestamp")

df = df.withColumn("data", f.col("data") - f.lag(f.col("data"), 1, 0).over(window))

您可以使用
lag
代替
shift
,使用
coalesce(,F.lit(0))
代替
fill\u value=0

from pyspark.sql.window import Window
import pyspark.sql.functions as F

window = Window.partitionBy("feed").orderBy("utc_timestamp")

data = F.col("data") - F.coalesce(F.lag(F.col("data")).over(window), F.lit(0))
df.withColumn("data", data)

您可以使用
lag
代替
shift
,使用
coalesce(,F.lit(0))
代替
fill\u value=0

from pyspark.sql.window import Window
import pyspark.sql.functions as F

window = Window.partitionBy("feed").orderBy("utc_timestamp")

data = F.col("data") - F.coalesce(F.lag(F.col("data")).over(window), F.lit(0))
df.withColumn("data", data)

您可以避免在滞后函数中使用合并设置默认值为0:lag(col,offset=1,default=0)您可以避免在滞后函数中使用合并设置默认值为0:lag(col,offset=1,default=0)