Apache spark 使用future中的其他列值在pyspark dataframe中创建新列
我有一个包含3列的数据框,我想使用列投影的值创建一个新列。Apache spark 使用future中的其他列值在pyspark dataframe中创建新列,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我有一个包含3列的数据框,我想使用列投影的值创建一个新列。 如何从投影中选取值?--它应该为ex选取3个连续的投影年值-如果要为2020年创建一个新列,它应该从202120222023中选取值 我尝试了以下sql: spark.sql(''从表中选择序列号,pit项目作为项目J1,年份,lead(pit项目,3)超过(按序列号按年份顺序划分)作为项目J2,其中序列号=按年份顺序1'')。显示(50,截断=False) 但这并不是完整的答案 可以使用滞后函数获取以前的值(在需要适当的排序/排序依据
如何从投影中选取值?--它应该为ex选取3个连续的投影年值-如果要为2020年创建一个新列,它应该从202120222023中选取值 我尝试了以下sql: spark.sql(''从表中选择序列号,pit项目作为项目J1,年份,lead(pit项目,3)超过(按序列号按年份顺序划分)作为项目J2,其中序列号=按年份顺序1'')。显示(50,截断=False) 但这并不是完整的答案
可以使用滞后函数获取以前的值(在需要适当的排序/排序依据之前)
如果您添加到具有所需输出的问题示例表中(因为只有源代码),我可以尝试复制它并更新答案(上面未测试)。这可以通过使用窗口函数和行框架来实现。我在Scala中提供答案,但肯定同样可以使用Python实现
import org.apache.spark.sql.expressions.Window
val df=Seq(("1","358934","2020"),("1","38639","2021"),("1","983434","2022"),("1","234","2023"),("1","2325","2024"),("1","4545","2025"),("1","7675","2026")).toDF("serial_number","projection","year")
val partitionExpr=Window.orderBy("year").rowsBetween(1, 3)
df.withColumn("total",sum("projection") over(partitionExpr)).show(false)
有关行和范围的更多信息,请参阅
您可以在
pyspark
# create data frame
df = spark.createDataFrame(
[("1","358934","2020"),
("1","38639","2021"),
("1","983434","2022"),
("1","234","2023"),
("1","2325","2024"),
("1","4545","2025"),
("1","7675","2026")],
("serial_number","projection","year"))
# Import functions
import pyspark.sql.functions as f
from pyspark.sql import Window
# Idntify the window partition
part_window=Window.orderBy("year").rowsBetween(1, 3)
# apply window partition to get new column
df1 = df.withColumn("projection2", f.sum("projection").over(part_window))
我在问题中添加了我想要的o/p,基本上我想使用公式创建一个名为projection2的新列-projection2在2020年的值应等于从2021年到2023年的projection列的值之和,依此类推
# create data frame
df = spark.createDataFrame(
[("1","358934","2020"),
("1","38639","2021"),
("1","983434","2022"),
("1","234","2023"),
("1","2325","2024"),
("1","4545","2025"),
("1","7675","2026")],
("serial_number","projection","year"))
# Import functions
import pyspark.sql.functions as f
from pyspark.sql import Window
# Idntify the window partition
part_window=Window.orderBy("year").rowsBetween(1, 3)
# apply window partition to get new column
df1 = df.withColumn("projection2", f.sum("projection").over(part_window))