Date PySpark数据框中从一列到另一列的最近日期
我有一个pyspark数据框,其中提到了商品的价格,但没有商品何时购买的数据,我只有1年的范围Date PySpark数据框中从一列到另一列的最近日期,date,apache-spark,pyspark,window-functions,pyspark-dataframes,Date,Apache Spark,Pyspark,Window Functions,Pyspark Dataframes,我有一个pyspark数据框,其中提到了商品的价格,但没有商品何时购买的数据,我只有1年的范围 +---------+------------+----------------+----------------+ |Commodity| BuyingPrice|Date_Upper_limit|Date_lower_limit| +---------+------------+----------------+----------------+ | Apple| 5|
+---------+------------+----------------+----------------+
|Commodity| BuyingPrice|Date_Upper_limit|Date_lower_limit|
+---------+------------+----------------+----------------+
| Apple| 5| 2020-07-04| 2019-07-03|
| Banana| 3| 2020-07-03| 2019-07-02|
| Banana| 4| 2019-10-02| 2018-10-01|
| Apple| 6| 2020-01-20| 2019-01-19|
| Banana| 3.5| 2019-08-17| 2018-08-16|
+---------+------------+----------------+----------------+
我有另一个pyspark数据框,可以看到所有商品的市场价格和日期
+----------+----------+------------+
| Date| Commodity|Market Price|
+----------+----------+------------+
|2020-07-01| Apple| 3|
|2020-07-01| Banana| 3|
|2020-07-02| Apple| 4|
|2020-07-02| Banana| 2.5|
|2020-07-03| Apple| 7|
|2020-07-03| Banana| 4|
+----------+----------+------------+
我希望看到最接近日期上限的日期,即该商品的市场价格(MP)<或=购买价格(BP)
预期输出(2个顶部列):
+---------+------------+----------------+----------------+--------------------------------+
|商品|购买价格|日期|上限|日期|下限| MP时最接近UL的日期使用条件连接
和窗口功能
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("Commodity")
df1\ #first dataframe shown being df1 and second being df2
.join(df2.withColumnRenamed("Commodity","Commodity1")\
, F.expr("""`Market Price`<=BuyingPrice and Date<Date_Upper_limit and Commodity==Commodity1"""))\
.drop("Market Price","Commodity1")\
.withColumn("max", F.max("Date").over(w))\
.filter('max==Date').drop("max").withColumnRenamed("Date","Closest Date to UL when MP <= BP")\
.show()
#+---------+-----------+----------------+----------------+--------------------------------+
#|Commodity|BuyingPrice|Date_Upper_limit|Date_lower_limit|Closest Date to UL when MP <= BP|
#+---------+-----------+----------------+----------------+--------------------------------+
#| Banana| 3.0| 2020-07-03| 2019-07-02| 2020-07-02|
#| Apple| 5.0| 2020-07-04| 2019-07-03| 2020-07-02|
#+---------+-----------+----------------+----------------+--------------------------------+
从pyspark.sql导入函数为F
从pyspark.sql.window导入窗口
w=窗口().分区依据(“商品”)
df1 \#显示的第一个数据帧为df1,第二个数据帧为df2
.join(df2.WITHCOLUMNRENAME(“商品”、“商品1”)\
,F.expr(“`市场价格`
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("Commodity")
df1\ #first dataframe shown being df1 and second being df2
.join(df2.withColumnRenamed("Commodity","Commodity1")\
, F.expr("""`Market Price`<=BuyingPrice and Date<Date_Upper_limit and Commodity==Commodity1"""))\
.drop("Market Price","Commodity1")\
.withColumn("max", F.max("Date").over(w))\
.filter('max==Date').drop("max").withColumnRenamed("Date","Closest Date to UL when MP <= BP")\
.show()
#+---------+-----------+----------------+----------------+--------------------------------+
#|Commodity|BuyingPrice|Date_Upper_limit|Date_lower_limit|Closest Date to UL when MP <= BP|
#+---------+-----------+----------------+----------------+--------------------------------+
#| Banana| 3.0| 2020-07-03| 2019-07-02| 2020-07-02|
#| Apple| 5.0| 2020-07-04| 2019-07-03| 2020-07-02|
#+---------+-----------+----------------+----------------+--------------------------------+